Skip to main content

union

The union template runs several independent sub-tokenizers over the same input and merges their tokens into one stream. Use it when a column needs to be searchable in more than one way at once — for example as a whole keyword and as character n-grams — without maintaining separate indexes.

Each member is configured with a TOKENIZER⟨N⟩_ prefix, numbered densely from 1: TOKENIZER1_TEMPLATE selects the first sub-tokenizer and its TOKENIZER1_* options configure it, TOKENIZER2_TEMPLATE the second, and so on. At least one member is required. Where pipeline feeds one analyzer's output into the next, union runs them in parallel over the original input and combines the results.

Options

OptionTypeDefaultDescription
TOKENIZER⟨N⟩_TEMPLATEstringrequiredTemplate of the Nth sub-tokenizer (numbered densely from 1)
TOKENIZER⟨N⟩_*Options for the Nth sub-tokenizer, prefixed with TOKENIZER⟨N⟩_

Tokenization

Every member analyzes the original input, and their outputs are pooled into a single token set. Pairing keyword (which keeps the value verbatim) with a 2-gram ngram member makes abcd searchable both as the exact term and by any of its bigrams. Pairing a delimiter member with keyword indexes hello world both as its individual words and as the whole phrase, so exact-phrase and per-word queries both hit.

InputMembersTokens
abcdkeyword + ngram (MINGRAM = MAXGRAM = 2){abcd,ab,bc,cd}
hello worlddelimiter (' ') + keyword{hello,"hello world",world}

Index each value both verbatim and as 2-grams:

Query
CREATE TEXT SEARCH DICTIONARY union_dict (    template = 'union',    -- member 1 keeps the value verbatim, member 2 emits 2-grams    TOKENIZER1_TEMPLATE = 'keyword',    TOKENIZER2_TEMPLATE = 'ngram',    TOKENIZER2_MINGRAM = 2,    TOKENIZER2_MAXGRAM = 2);
SELECT ts_lexize('union_dict', 'abcd');
Result
 ts_lexize----------------- {abcd,ab,bc,cd}

Index text both as individual words and as the whole phrase:

Query
CREATE TEXT SEARCH DICTIONARY union_word_phrase (    template = 'union',    -- member 1 splits into words, member 2 keeps the whole phrase    TOKENIZER1_TEMPLATE = 'delimiter',    TOKENIZER1_DELIMITER = ' ',    TOKENIZER2_TEMPLATE = 'keyword');
SELECT ts_lexize('union_word_phrase', 'hello world');
Result
 ts_lexize----------------------------- {hello,"hello world",world}

See also

This page contains: