stopwords
The stopwords template removes the words listed in STOPWORDS from the token stream rather than producing tokens of its own. Dropping very common words (the, a, is) shrinks the index and keeps high-frequency terms from dominating relevance scores.
Because it is a filter, it operates on the output of an earlier tokenizer and is therefore used as a stage inside a pipeline, after a template such as text or segmentation. Set HEX = true when the stop words are supplied as hex-encoded byte strings. Note that text can filter stop words on its own through its STOPWORDS option — this template is for applying the same filtering within a custom pipeline.
Options
| Option | Type | Default | Description |
|---|---|---|---|
STOPWORDS | string list | — | Stop words (e.g., '"the","a","an"') |
HEX | boolean | false | Treat stop words as hex-encoded strings |
Tokenization
stopwords compares each token it receives against the list and drops the ones that match, passing everything else through unchanged. A token that is itself a stop word is removed, leaving no output; a token that is not in the list survives.
| Input | STOPWORDS | Output tokens |
|---|---|---|
the | "the","a","an","is" | (empty — removed) |
cat | "the","a","an","is" | cat |
CREATE TEXT SEARCH DICTIONARY stop_filter ( template = 'stopwords', stopwords = '"the","a","an","is"');
SELECT ts_lexize('stop_filter', 'the'); ts_lexize----------- {}Filtering inside a pipeline
In practice stopwords follows a tokenizer. A pipeline that splits on spaces and then filters drops the common words from a phrase while keeping the rest:
| Input | Pipeline | Output tokens |
|---|---|---|
the cat is a animal | delimiter (space) → stopwords | cat, animal |
CREATE TEXT SEARCH DICTIONARY stop_pipeline ( template = 'pipeline', step1_template = 'delimiter', step1_delimiter = ' ', step2_template = 'stopwords', step2_stopwords = '"the","a","an","is"');
SELECT ts_lexize('stop_pipeline', 'the cat is a animal'); ts_lexize-------------- {cat,animal}Hex-encoded stopwords
With HEX = true the stop words are decoded from hex before matching, so 616263 filters the token abc:
CREATE TEXT SEARCH DICTIONARY hex_stop ( template = 'stopwords', stopwords = '"616263","6D6E6F"', HEX = true);
SELECT ts_lexize('hex_stop', 'abc'); ts_lexize----------- {}See also
- text — tokenizer with a built-in
STOPWORDSoption - pipeline — chain a tokenizer before
stopwords - CREATE TEXT SEARCH DICTIONARY