Skip to main content

text

The text template is the general-purpose word tokenizer and the one to reach for first for natural-language search. It splits input into words on Unicode boundaries and then, under the control of an ICU LOCALE, optionally folds case, strips accent marks, applies Snowball stemming and removes stop words.

Stemming maps inflected forms to a common root — running, runs and ran all index as run — so a query matches a document even when the surface forms differ. Because the same dictionary analyzes both the indexed text and the query, the search term is reduced the same way, so the two always meet. Stop words can be supplied inline with STOPWORDS or loaded from a file with STOPWORDSPATH, and accent folding (ACCENT) lets café match cafe.

Enable the FREQUENCY and POSITION feature flags on the indexed column when you need relevance ranking or phrase and proximity search, respectively.

Options

OptionTypeDefaultDescription
LOCALEstringICU locale (e.g., 'en_US.UTF-8', 'fr', 'de')
CASEstring'none'Case conversion: 'none', 'lower', 'upper'
STEMMINGbooleantrueApply word stemming
ACCENTbooleantruePreserve accent marks
STOPWORDSstring listInline stop words (e.g., '"the","a","an"')
STOPWORDSPATHstringPath to a stopwords file
MINGRAMinteger2Edge n-gram minimum length
MAXGRAMinteger3Edge n-gram maximum length
PRESERVEORIGINALbooleanfalseEmit original token alongside n-grams

Tokenization

The text template splits input on Unicode word boundaries, then applies the normalization steps you enable: case folding (CASE), accent folding (ACCENT = false), Snowball stemming (STEMMING) and stop-word removal (STOPWORDS). With stemming on, inflected forms collapse to a shared root so a query meets a document even when the surface forms differ. Setting MINGRAM/MAXGRAM adds edge n-grams — prefix-anchored fragments of each word — which is what powers as-you-type autocomplete.

InputOptionsTokens
The runners were running quicklyCASE = 'lower', STEMMING = true{the,runner,were,run,quick}
The Runners CaféCASE = 'none', STEMMING = false, ACCENT = true{The,Runners,Café}
The cat is a hunterSTOPWORDS = '"the","a","an","is"'{cat,hunter}
SearchMINGRAM = 2, MAXGRAM = 4, PRESERVEORIGINAL = true{se,sea,sear,search}

Stemming reduces runners to runner and running to run, so both index under a shared root; quickly becomes quick. Note that stop words are only removed when STOPWORDS is set — by default common words like the are kept. Use ts_lexize to preview the exact token stream for any configuration:

Query
CREATE TEXT SEARCH DICTIONARY tok_text_stem (    template = 'text',    locale = 'en_US.UTF-8',    case = 'lower',    stemming = true,    accent = true);
SELECT ts_lexize('tok_text_stem', 'The runners were running quickly');
Result
 ts_lexize----------------------------- {the,runner,were,run,quick}

With CASE = 'none' and STEMMING = false the words keep their original form and casing, and accent marks survive because ACCENT = true:

Query
CREATE TEXT SEARCH DICTIONARY tok_text_exact (    template = 'text',    locale = 'en_US.UTF-8',    case = 'none',    stemming = false,    accent = true);
SELECT ts_lexize('tok_text_exact', 'The Runners Café');
Result
 ts_lexize-------------------- {The,Runners,Café}

Supplying STOPWORDS drops the listed words from the stream:

Query
CREATE TEXT SEARCH DICTIONARY tok_text_stop (    template = 'text',    locale = 'en_US.UTF-8',    case = 'lower',    stemming = true,    stopwords = '"the","a","an","is"');
SELECT ts_lexize('tok_text_stop', 'The cat is a hunter');
Result
 ts_lexize-------------- {cat,hunter}

Setting MINGRAM/MAXGRAM emits prefix-anchored edge n-grams of each word, so a partial query like sea matches Search — the basis for autocomplete:

Query
CREATE TEXT SEARCH DICTIONARY tok_text_edge (    template = 'text',    locale = 'en_US.UTF-8',    case = 'lower',    mingram = 2,    maxgram = 4,    preserveoriginal = true);
SELECT ts_lexize('tok_text_edge', 'Search');
Result
 ts_lexize---------------------- {se,sea,sear,search}

Examples

Basic English dictionary

Query
CREATE TEXT SEARCH DICTIONARY english_dict (    template = 'text',    locale = 'en_US.UTF-8',    case = 'lower',    stemming = true,    accent = true,    frequency = true,    position = true);

No stemming, case-sensitive

Query
CREATE TEXT SEARCH DICTIONARY exact_dict (    template = 'text',    locale = 'en_US.UTF-8',    case = 'none',    stemming = false,    accent = false,    frequency = true,    position = true);

With edge n-grams for autocomplete

Query
CREATE TEXT SEARCH DICTIONARY autocomplete_dict (    template = 'text',    locale = 'en_US.UTF-8',    case = 'lower',    mingram = 2,    maxgram = 5,    PRESERVEORIGINAL = true);

With inline stopwords

Query
CREATE TEXT SEARCH DICTIONARY filtered_dict (    template = 'text',    locale = 'en_US.UTF-8',    case = 'lower',    stemming = true,    stopwords = '"the","a","an","is","at"');

See also