segmentation
The segmentation template splits text into tokens using the language-agnostic word-boundary algorithm defined by Unicode Standard Annex #29 (Unicode Text Segmentation).
It derives boundaries from the Unicode properties of the characters themselves rather than from whitespace or any per-language dictionary, so a single dictionary works across scripts. That makes it the right choice for languages that do not separate words with spaces — such as Chinese, Japanese and Thai — where a plain delimiter split would treat a whole sentence as a single token. For ASCII text where words are already space-separated, a delimiter split is simpler and faster.
Options
| Option | Type | Default | Description |
|---|---|---|---|
CASE | string | 'none' | Case conversion applied to each token: 'none', 'lower', 'upper' |
BREAK | string | 'alpha' | Which boundaries produce tokens: 'alpha' (alphabetic and numeric runs only), 'graphic' (visible characters including punctuation), 'all' (a token at every boundary, including whitespace) |
Tokenization
The BREAK mode decides which Unicode word boundaries are kept. alpha (the default) emits only the runs of letters and digits and discards punctuation and whitespace entirely. graphic additionally keeps each visible punctuation mark as its own token. all emits every segment, including the whitespace runs between words. CASE is then applied to each emitted token.
The following table shows how the input The Quick fox-trot. is tokenized under each BREAK mode:
BREAK | Tokens |
|---|---|
alpha | The, Quick, fox, trot |
graphic | The, Quick, fox, -, trot, . |
all | The, , Quick, , fox, -, trot, . |
Because boundaries come from Unicode properties, the same dictionary also segments scripts that do not use spaces, while ASCII words are split exactly where you would expect.
This dictionary lowercases each alphabetic/numeric run:
CREATE TEXT SEARCH DICTIONARY seg_dict ( template = 'segmentation', case = 'lower', BREAK = 'alpha');
SELECT ts_lexize('seg_dict', 'The Quick fox-trot.'); ts_lexize---------------------- {the,quick,fox,trot}Graphic boundaries
BREAK = 'graphic' keeps punctuation as separate tokens:
CREATE TEXT SEARCH DICTIONARY seg_graphic ( template = 'segmentation', BREAK = 'graphic');
SELECT ts_lexize('seg_graphic', 'The Quick fox-trot.'); ts_lexize-------------------------- {The,Quick,fox,-,trot,.}All boundaries
BREAK = 'all' emits a token at every boundary, whitespace included, and here uppercases the result:
CREATE TEXT SEARCH DICTIONARY seg_all ( template = 'segmentation', case = 'upper', BREAK = 'all');
SELECT ts_lexize('seg_all', 'The Quick fox-trot.'); ts_lexize---------------------------------- {THE," ",QUICK," ",FOX,-,TROT,.}See also
- delimiter — split space-separated text on a literal character
- text — full linguistic pipeline (case folding, stemming, stopwords)
- CREATE TEXT SEARCH DICTIONARY