Skip to main content

pattern

The pattern template tokenizes text with an RE2 regular expression.

It works in two modes selected by the GROUP option. In extract mode (GROUP = 0 for the whole match, or N > 0 for the Nth capture group) every match becomes a token. In split mode (GROUP = -1, the default) the pattern marks the separators and the text between matches becomes the tokens. This makes it useful both for pulling structured tokens out of free text — identifiers, codes, mentions — and for splitting on separators too complex for a fixed delimiter.

Options

OptionTypeDefaultDescription
PATTERNstringrequiredRE2 regular expression used to match (extract mode) or to mark separators (split mode)
GROUPinteger-1What to emit: -1 = split on each match, 0 = the whole match, N > 0 = the Nth capture group

Tokenization

In split mode the pattern describes the separators between tokens, so the tokens are the gaps. In extract mode the pattern describes the tokens themselves, so anything not matched is dropped — and with GROUP = N only the Nth parenthesized capture group of each match is kept.

The table below shows the same idea from both directions, plus capture-group extraction:

ModePATTERNGROUPInputTokens
split[-_.]-1SereneDB-2024_v1.2SereneDB, 2024, v1, 2
split\s+-1alpha beta gammaalpha, beta, gamma
extract[A-Z][A-Za-z0-9]{2,}0The Quick Brown fox jumps over Lazy DogThe, Quick, Brown, Lazy, Dog
extract([a-zA-Z]+)(\d+)2abc123def456ghi123, 456

Extract every capitalized word (GROUP = 0)

Each whole match becomes a token; the lowercase fox, jumps and over are not matched and so are dropped:

Query
CREATE TEXT SEARCH DICTIONARY caps (    template = 'pattern',    pattern = '[A-Z][A-Za-z0-9]{2,}',    group = 0);
SELECT ts_lexize('caps', 'The Quick Brown fox jumps over Lazy Dog');
Result
 ts_lexize---------------------------- {The,Quick,Brown,Lazy,Dog}

Split on runs of whitespace (GROUP = -1)

Here the pattern \s+ marks the separators and the runs of text between them are emitted:

Query
CREATE TEXT SEARCH DICTIONARY ws_split (    template = 'pattern',    pattern = '\s+',    group = -1);
SELECT ts_lexize('ws_split', 'alpha  beta   gamma');
Result
 ts_lexize-------------------- {alpha,beta,gamma}

Split an identifier on several delimiters (GROUP = -1)

A character class splits on -, _ or . in a single pass — something a fixed delimiter cannot do:

Query
CREATE TEXT SEARCH DICTIONARY id_split (    template = 'pattern',    pattern = '[-_.]',    group = -1);
SELECT ts_lexize('id_split', 'SereneDB-2024_v1.2');
Result
 ts_lexize---------------------- {SereneDB,2024,v1,2}

Keep only a capture group (GROUP = 2)

With GROUP = 2 each match emits just its second capture group — the trailing digits:

Query
CREATE TEXT SEARCH DICTIONARY trailing_digits (    template = 'pattern',    pattern = '([a-zA-Z]+)(\d+)',    group = 2);
SELECT ts_lexize('trailing_digits', 'abc123def456ghi');
Result
 ts_lexize----------- {123,456}

See also