Text Analysis
Text analysis turns the free-form text of a column into the sequence of tokens that an inverted index actually stores and searches. Getting analysis right is what lets a search for running shoes find a document that says Run faster in our Shoes! — the surface forms differ, but they analyze to the same tokens.
Analysis in SereneDB is configured entirely through a text search dictionary attached to a column. A dictionary is assembled from templates: some templates tokenize (split text into tokens), others normalize (lowercase, fold accents, stem, drop stop words), and the pipeline template composes them. There is no separate "token filter" object — every stage is a template.
The same analysis at index time and query time
The single most important rule: a column's dictionary is applied both when the column is indexed and when a query runs against it. The data and the query pass through the identical pipeline, so their tokens line up.
Without analysis, a literal comparison of the query FOX against the stored text Quick BROWN Fox would not match. After analysis both sides reduce to the token fox, and the lookup succeeds. You can preview exactly how a dictionary tokenizes any string with ts_lexize — use it whenever you are tuning a dictionary:
CREATE TEXT SEARCH DICTIONARY ta_text ( template = 'text', locale = 'en_US.UTF-8', case = 'lower', stemming = false, accent = false);
SELECT ts_lexize('ta_text', 'Quick BROWN Fox'); ts_lexize------------------- {quick,brown,fox}Symmetry is the default, not a hard rule. To analyze a query string with a different dictionary than the column's — Elasticsearch's search_analyzer — wrap it in ts_tokenize(text, 'dict') or the 'text'::tokenize('dict') cast. A common use is forcing exact matching against an otherwise-stemmed column with '…'::tokenize('keyword').
Tokenizing templates
The tokenizing template decides how text is split. The most common is text, which splits on word boundaries (shown above). Others target specific needs:
A verbatim column — the keyword template, or simply a column with no dictionary — emits the whole value as a single token, giving exact, case-sensitive matching for ids, codes and categories:
CREATE TEXT SEARCH DICTIONARY ta_keyword (template = 'keyword');
SELECT ts_lexize('ta_keyword', 'Hello World'); ts_lexize----------------- {"Hello World"}The ngram template emits overlapping character n-grams, which power substring and fuzzy matching:
CREATE TEXT SEARCH DICTIONARY ta_ngram ( template = 'ngram', mingram = 2, maxgram = 3);
SELECT ts_lexize('ta_ngram', 'cat'); ts_lexize------------- {ca,cat,at}Further tokenizing templates — sparse_ngram, delimiter / multi_delimiter, segmentation, pattern, path_hierarchy, wildcard — are listed in the CREATE TEXT SEARCH DICTIONARY reference.
Normalization
Normalization rewrites tokens so that equivalent forms collapse together. The text template exposes the common normalizers as options:
Stemming reduces words to a root form, improving recall by matching different inflections:
CREATE TEXT SEARCH DICTIONARY ta_stem ( template = 'text', locale = 'en_US.UTF-8', case = 'lower', stemming = true, accent = false);
SELECT ts_lexize('ta_stem', 'running runners ran'); ts_lexize------------------ {run,runner,ran}Stop words drop high-frequency words that carry little meaning (the list is comma-separated and quoted):
CREATE TEXT SEARCH DICTIONARY ta_stopwords ( template = 'text', locale = 'en_US.UTF-8', case = 'lower', stemming = false, accent = false, stopwords = '"the", "of"');
SELECT ts_lexize('ta_stopwords', 'the speed of light'); ts_lexize--------------- {speed,light}Accent folding maps accented characters to their ASCII base so café matches cafe. It is controlled by accent — accent = false folds accents away, accent = true preserves them:
CREATE TEXT SEARCH DICTIONARY ta_accent ( template = 'text', locale = 'en_US.UTF-8', case = 'lower', stemming = false, accent = false);
SELECT ts_lexize('ta_accent', 'Café'); ts_lexize----------- {cafe}Case folding (case = 'lower') is applied in every example above. Dedicated normalizing templates also exist — stem, norm, stopwords, collation — for use inside a pipeline.
Locale-aware analysis. The collation and norm templates take an ICU locale, so sorting and equality follow a language's rules rather than raw byte order — German de, for example, sorts ä next to a. A collation dictionary turns each value into one locale-ordered key, which is ideal for range queries and exact ordering on a column. The same ICU locales back the SQL COLLATE clause.
Composing with pipeline
The pipeline template chains templates in order. Steps are numbered starting at 1 (step1_template, step2_template, …). Here a delimiter tokenizer splits on commas, then a norm step lowercases each token:
CREATE TEXT SEARCH DICTIONARY ta_pipeline ( template = 'pipeline', step1_template = 'delimiter', step1_delimiter = ',', step2_template = 'norm', step2_locale = 'en_US.UTF-8', step2_case = 'lower');
SELECT ts_lexize('ta_pipeline', 'RED,Green,BLUE'); ts_lexize------------------ {red,green,blue}Token positions and feature flags
By default the index records only which terms appear in which rows. Some query and ranking features need extra per-token information, enabled with feature flags on the dictionary (or per-column in the index):
| Flag | Records | Needed for |
|---|---|---|
frequency | how often each term occurs | relevance scoring |
position | each token's ordinal position | phrase and proximity queries |
offset | each token's character offsets | highlighting |
norm | a length-normalization factor | some scorers |
The flags have dependencies: position and norm require frequency, and offset requires frequency and position. Positions are what let phrase search distinguish quick brown fox from fox brown quick — the tokens are the same, but their positions differ:
Enable only the flags your queries need — each one enlarges the index.
See also
CREATE TEXT SEARCH DICTIONARY— every template and option- Inverted Index — operator classes and how dictionaries attach to columns
- Full-Text Search · Ranking
- Full-Text Search Functions —
ts_lexizeand the query functions