Edit this page

union

The union template runs several independent sub-tokenizers over the same input and merges their tokens into one stream. Use it when a column needs to be searchable in more than one way at once — for example as a whole keyword and as character n-grams — without maintaining separate indexes.

Each member is configured with a TOKENIZER⟨N⟩_ prefix, numbered densely from 1: TOKENIZER1_TEMPLATE selects the first sub-tokenizer and its TOKENIZER1_* options configure it, TOKENIZER2_TEMPLATE the second, and so on. At least one member is required. Where pipeline feeds one analyzer's output into the next, union runs them in parallel over the original input and combines the results.

Options

Option	Type	Default	Description
`TOKENIZER⟨N⟩_TEMPLATE`	string	required	Template of the Nth sub-tokenizer (numbered densely from 1)
`TOKENIZER⟨N⟩_*`	—	—	Options for the Nth sub-tokenizer, prefixed with `TOKENIZER⟨N⟩_`

Tokenization

Every member analyzes the original input, and their outputs are pooled into a single token set. Pairing keyword (which keeps the value verbatim) with a 2-gram ngram member makes abcd searchable both as the exact term and by any of its bigrams. Pairing a delimiter member with keyword indexes hello world both as its individual words and as the whole phrase, so exact-phrase and per-word queries both hit.

Input	Members	Tokens
`abcd`	`keyword` + `ngram` (`MINGRAM = MAXGRAM = 2`)	`{abcd,ab,bc,cd}`
`hello world`	`delimiter` (`' '`) + `keyword`	`{hello,"hello world",world}`

Index each value both verbatim and as 2-grams:

Query

CREATE TEXT SEARCH DICTIONARY union_dict (    template = 'union',    -- member 1 keeps the value verbatim, member 2 emits 2-grams    TOKENIZER1_TEMPLATE = 'keyword',    TOKENIZER2_TEMPLATE = 'ngram',    TOKENIZER2_MINGRAM = 2,    TOKENIZER2_MAXGRAM = 2);
SELECT ts_lexize('union_dict', 'abcd');

Result

 ts_lexize----------------- {abcd,ab,bc,cd}

Index text both as individual words and as the whole phrase:

Query

CREATE TEXT SEARCH DICTIONARY union_word_phrase (    template = 'union',    -- member 1 splits into words, member 2 keeps the whole phrase    TOKENIZER1_TEMPLATE = 'delimiter',    TOKENIZER1_DELIMITER = ' ',    TOKENIZER2_TEMPLATE = 'keyword');
SELECT ts_lexize('union_word_phrase', 'hello world');

Result

 ts_lexize----------------------------- {hello,"hello world",world}

Options​

Tokenization​

See also​

Options

Tokenization

See also