Edit this page

pattern

The pattern template tokenizes text with an RE2 regular expression.

It works in two modes selected by the GROUP option. In extract mode (GROUP = 0 for the whole match, or N > 0 for the Nth capture group) every match becomes a token. In split mode (GROUP = -1, the default) the pattern marks the separators and the text between matches becomes the tokens. This makes it useful both for pulling structured tokens out of free text — identifiers, codes, mentions — and for splitting on separators too complex for a fixed delimiter.

Options

Option	Type	Default	Description
`PATTERN`	string	required	RE2 regular expression used to match (extract mode) or to mark separators (split mode)
`GROUP`	integer	`-1`	What to emit: `-1` = split on each match, `0` = the whole match, `N > 0` = the Nth capture group

Tokenization

In split mode the pattern describes the separators between tokens, so the tokens are the gaps. In extract mode the pattern describes the tokens themselves, so anything not matched is dropped — and with GROUP = N only the Nth parenthesized capture group of each match is kept.

The table below shows the same idea from both directions, plus capture-group extraction:

Mode	`PATTERN`	`GROUP`	Input	Tokens
split	`[-_.]`	`-1`	`SereneDB-2024_v1.2`	`SereneDB`, `2024`, `v1`, `2`
split	`\s+`	`-1`	`alpha beta gamma`	`alpha`, `beta`, `gamma`
extract	`[A-Z][A-Za-z0-9]{2,}`	`0`	`The Quick Brown fox jumps over Lazy Dog`	`The`, `Quick`, `Brown`, `Lazy`, `Dog`
extract	`([a-zA-Z]+)(\d+)`	`2`	`abc123def456ghi`	`123`, `456`

Extract every capitalized word (`GROUP = 0`)

Each whole match becomes a token; the lowercase fox, jumps and over are not matched and so are dropped:

Query

CREATE TEXT SEARCH DICTIONARY caps (    template = 'pattern',    pattern = '[A-Z][A-Za-z0-9]{2,}',    group = 0);
SELECT ts_lexize('caps', 'The Quick Brown fox jumps over Lazy Dog');

Result

 ts_lexize---------------------------- {The,Quick,Brown,Lazy,Dog}

Split on runs of whitespace (`GROUP = -1`)

Here the pattern \s+ marks the separators and the runs of text between them are emitted:

Query

CREATE TEXT SEARCH DICTIONARY ws_split (    template = 'pattern',    pattern = '\s+',    group = -1);
SELECT ts_lexize('ws_split', 'alpha  beta   gamma');

Result

 ts_lexize-------------------- {alpha,beta,gamma}

Split an identifier on several delimiters (`GROUP = -1`)

A character class splits on -, _ or . in a single pass — something a fixed delimiter cannot do:

Query

CREATE TEXT SEARCH DICTIONARY id_split (    template = 'pattern',    pattern = '[-_.]',    group = -1);
SELECT ts_lexize('id_split', 'SereneDB-2024_v1.2');

Result

 ts_lexize---------------------- {SereneDB,2024,v1,2}

Keep only a capture group (`GROUP = 2`)

With GROUP = 2 each match emits just its second capture group — the trailing digits:

Query

CREATE TEXT SEARCH DICTIONARY trailing_digits (    template = 'pattern',    pattern = '([a-zA-Z]+)(\d+)',    group = 2);
SELECT ts_lexize('trailing_digits', 'abc123def456ghi');

Result

 ts_lexize----------- {123,456}

Options​

Tokenization​

Extract every capitalized word (GROUP = 0)​

Split on runs of whitespace (GROUP = -1)​

Split an identifier on several delimiters (GROUP = -1)​

Keep only a capture group (GROUP = 2)​

See also​

Options

Tokenization

Extract every capitalized word (`GROUP = 0`)

Split on runs of whitespace (`GROUP = -1`)

Split an identifier on several delimiters (`GROUP = -1`)

Keep only a capture group (`GROUP = 2`)

See also