Edit this page

Text Analysis

Text analysis turns the free-form text of a column into the sequence of tokens that an inverted index actually stores and searches. Getting analysis right is what lets a search for running shoes find a document that says Run faster in our Shoes! — the surface forms differ, but they analyze to the same tokens.

Analysis in SereneDB is configured entirely through a text search dictionary attached to a column. A dictionary is assembled from templates: some templates tokenize (split text into tokens), others normalize (lowercase, fold accents, stem, drop stop words), and the pipeline template composes them. There is no separate "token filter" object — every stage is a template.

The same analysis at index time and query time

The single most important rule: a column's dictionary is applied both when the column is indexed and when a query runs against it. The data and the query pass through the identical pipeline, so their tokens line up.

Without analysis, a literal comparison of the query FOX against the stored text Quick BROWN Fox would not match. After analysis both sides reduce to the token fox, and the lookup succeeds. You can preview exactly how a dictionary tokenizes any string with ts_lexize — use it whenever you are tuning a dictionary:

Query

CREATE TEXT SEARCH DICTIONARY ta_text (    template = 'text',    locale = 'en_US.UTF-8',    case = 'lower',    stemming = false,    accent = false);
SELECT ts_lexize('ta_text', 'Quick BROWN Fox');

Result

 ts_lexize------------------- {quick,brown,fox}

Overriding the query-time analyzer

Symmetry is the default, not a hard rule. To analyze a query string with a different dictionary than the column's — Elasticsearch's search_analyzer — wrap it in ts_tokenize(text, 'dict') or the 'text'::tokenize('dict') cast. A common use is forcing exact matching against an otherwise-stemmed column with '…'::tokenize('keyword').

Tokenizing templates

The tokenizing template decides how text is split. The most common is text, which splits on word boundaries (shown above). Others target specific needs:

A verbatim column — the keyword template, or simply a column with no dictionary — emits the whole value as a single token, giving exact, case-sensitive matching for ids, codes and categories:

Query

CREATE TEXT SEARCH DICTIONARY ta_keyword (template = 'keyword');
SELECT ts_lexize('ta_keyword', 'Hello World');

Result

 ts_lexize----------------- {"Hello World"}

The ngram template emits overlapping character n-grams, which power substring and fuzzy matching:

Query

CREATE TEXT SEARCH DICTIONARY ta_ngram (    template = 'ngram',    mingram = 2,    maxgram = 3);
SELECT ts_lexize('ta_ngram', 'cat');

Result

 ts_lexize------------- {ca,cat,at}

Further tokenizing templates — sparse_ngram, delimiter / multi_delimiter, segmentation, pattern, path_hierarchy, wildcard — are listed in the CREATE TEXT SEARCH DICTIONARY reference.

Normalization

Normalization rewrites tokens so that equivalent forms collapse together. The text template exposes the common normalizers as options:

Stemming reduces words to a root form, improving recall by matching different inflections:

Query

CREATE TEXT SEARCH DICTIONARY ta_stem (    template = 'text',    locale = 'en_US.UTF-8',    case = 'lower',    stemming = true,    accent = false);
SELECT ts_lexize('ta_stem', 'running runners ran');

Result

 ts_lexize------------------ {run,runner,ran}

Stop words drop high-frequency words that carry little meaning (the list is comma-separated and quoted):

Query

CREATE TEXT SEARCH DICTIONARY ta_stopwords (    template = 'text',    locale = 'en_US.UTF-8',    case = 'lower',    stemming = false,    accent = false,    stopwords = '"the", "of"');
SELECT ts_lexize('ta_stopwords', 'the speed of light');

Result

 ts_lexize--------------- {speed,light}

Accent folding maps accented characters to their ASCII base so café matches cafe. It is controlled by accent — accent = false folds accents away, accent = true preserves them:

Query

CREATE TEXT SEARCH DICTIONARY ta_accent (    template = 'text',    locale = 'en_US.UTF-8',    case = 'lower',    stemming = false,    accent = false);
SELECT ts_lexize('ta_accent', 'Café');

Result

 ts_lexize----------- {cafe}

Case folding (case = 'lower') is applied in every example above. Dedicated normalizing templates also exist — stem, norm, stopwords, collation — for use inside a pipeline.

Locale-aware analysis. The collation and norm templates take an ICU locale, so sorting and equality follow a language's rules rather than raw byte order — German de, for example, sorts ä next to a. A collation dictionary turns each value into one locale-ordered key, which is ideal for range queries and exact ordering on a column. The same ICU locales back the SQL COLLATE clause.

Composing with `pipeline`

The pipeline template chains templates in order. Steps are numbered starting at 1 (step1_template, step2_template, …). Here a delimiter tokenizer splits on commas, then a norm step lowercases each token:

Query

CREATE TEXT SEARCH DICTIONARY ta_pipeline (    template = 'pipeline',    step1_template = 'delimiter',    step1_delimiter = ',',    step2_template = 'norm',    step2_locale = 'en_US.UTF-8',    step2_case = 'lower');
SELECT ts_lexize('ta_pipeline', 'RED,Green,BLUE');

Result

 ts_lexize------------------ {red,green,blue}

Token positions and feature flags

By default the index records only which terms appear in which rows. Some query and ranking features need extra per-token information, enabled with feature flags on the dictionary (or per-column in the index):

Flag	Records	Needed for
`frequency`	how often each term occurs	relevance scoring
`position`	each token's ordinal position	phrase and proximity queries
`offset`	each token's character offsets	highlighting
`norm`	a length-normalization factor	some scorers

The flags have dependencies: position and norm require frequency, and offset requires frequency and position. Positions are what let phrase search distinguish quick brown fox from fox brown quick — the tokens are the same, but their positions differ:

Enable only the flags your queries need — each one enlarges the index.

The same analysis at index time and query time​

Tokenizing templates​

Normalization​

Composing with pipeline​

Token positions and feature flags​

See also​

The same analysis at index time and query time

Tokenizing templates

Normalization

Composing with `pipeline`

Token positions and feature flags

See also