pattern
The pattern template tokenizes text with an RE2 regular expression.
It works in two modes selected by the GROUP option. In extract mode (GROUP = 0 for the whole match, or N > 0 for the Nth capture group) every match becomes a token. In split mode (GROUP = -1, the default) the pattern marks the separators and the text between matches becomes the tokens. This makes it useful both for pulling structured tokens out of free text — identifiers, codes, mentions — and for splitting on separators too complex for a fixed delimiter.
Options
| Option | Type | Default | Description |
|---|---|---|---|
PATTERN | string | required | RE2 regular expression used to match (extract mode) or to mark separators (split mode) |
GROUP | integer | -1 | What to emit: -1 = split on each match, 0 = the whole match, N > 0 = the Nth capture group |
Tokenization
In split mode the pattern describes the separators between tokens, so the tokens are the gaps. In extract mode the pattern describes the tokens themselves, so anything not matched is dropped — and with GROUP = N only the Nth parenthesized capture group of each match is kept.
The table below shows the same idea from both directions, plus capture-group extraction:
| Mode | PATTERN | GROUP | Input | Tokens |
|---|---|---|---|---|
| split | [-_.] | -1 | SereneDB-2024_v1.2 | SereneDB, 2024, v1, 2 |
| split | \s+ | -1 | alpha beta gamma | alpha, beta, gamma |
| extract | [A-Z][A-Za-z0-9]{2,} | 0 | The Quick Brown fox jumps over Lazy Dog | The, Quick, Brown, Lazy, Dog |
| extract | ([a-zA-Z]+)(\d+) | 2 | abc123def456ghi | 123, 456 |
Extract every capitalized word (GROUP = 0)
Each whole match becomes a token; the lowercase fox, jumps and over are not matched and so are dropped:
CREATE TEXT SEARCH DICTIONARY caps ( template = 'pattern', pattern = '[A-Z][A-Za-z0-9]{2,}', group = 0);
SELECT ts_lexize('caps', 'The Quick Brown fox jumps over Lazy Dog'); ts_lexize---------------------------- {The,Quick,Brown,Lazy,Dog}Split on runs of whitespace (GROUP = -1)
Here the pattern \s+ marks the separators and the runs of text between them are emitted:
CREATE TEXT SEARCH DICTIONARY ws_split ( template = 'pattern', pattern = '\s+', group = -1);
SELECT ts_lexize('ws_split', 'alpha beta gamma'); ts_lexize-------------------- {alpha,beta,gamma}Split an identifier on several delimiters (GROUP = -1)
A character class splits on -, _ or . in a single pass — something a fixed delimiter cannot do:
CREATE TEXT SEARCH DICTIONARY id_split ( template = 'pattern', pattern = '[-_.]', group = -1);
SELECT ts_lexize('id_split', 'SereneDB-2024_v1.2'); ts_lexize---------------------- {SereneDB,2024,v1,2}Keep only a capture group (GROUP = 2)
With GROUP = 2 each match emits just its second capture group — the trailing digits:
CREATE TEXT SEARCH DICTIONARY trailing_digits ( template = 'pattern', pattern = '([a-zA-Z]+)(\d+)', group = 2);
SELECT ts_lexize('trailing_digits', 'abc123def456ghi'); ts_lexize----------- {123,456}See also
- delimiter / multi_delimiter — split on literal characters
- segmentation — Unicode word-boundary splitting
- CREATE TEXT SEARCH DICTIONARY