Skip to main content

stopwords

The stopwords template removes the words listed in STOPWORDS from the token stream rather than producing tokens of its own. Dropping very common words (the, a, is) shrinks the index and keeps high-frequency terms from dominating relevance scores.

Because it is a filter, it operates on the output of an earlier tokenizer and is therefore used as a stage inside a pipeline, after a template such as text or segmentation. Set HEX = true when the stop words are supplied as hex-encoded byte strings. Note that text can filter stop words on its own through its STOPWORDS option — this template is for applying the same filtering within a custom pipeline.

Options

OptionTypeDefaultDescription
STOPWORDSstring listStop words (e.g., '"the","a","an"')
HEXbooleanfalseTreat stop words as hex-encoded strings

Tokenization

stopwords compares each token it receives against the list and drops the ones that match, passing everything else through unchanged. A token that is itself a stop word is removed, leaving no output; a token that is not in the list survives.

InputSTOPWORDSOutput tokens
the"the","a","an","is"(empty — removed)
cat"the","a","an","is"cat
Query
CREATE TEXT SEARCH DICTIONARY stop_filter (    template = 'stopwords',    stopwords = '"the","a","an","is"');
SELECT ts_lexize('stop_filter', 'the');
Result
 ts_lexize----------- {}

Filtering inside a pipeline

In practice stopwords follows a tokenizer. A pipeline that splits on spaces and then filters drops the common words from a phrase while keeping the rest:

InputPipelineOutput tokens
the cat is a animaldelimiter (space) → stopwordscat, animal
Query
CREATE TEXT SEARCH DICTIONARY stop_pipeline (    template = 'pipeline',    step1_template = 'delimiter',    step1_delimiter = ' ',    step2_template = 'stopwords',    step2_stopwords = '"the","a","an","is"');
SELECT ts_lexize('stop_pipeline', 'the cat is a animal');
Result
 ts_lexize-------------- {cat,animal}

Hex-encoded stopwords

With HEX = true the stop words are decoded from hex before matching, so 616263 filters the token abc:

Query
CREATE TEXT SEARCH DICTIONARY hex_stop (    template = 'stopwords',    stopwords = '"616263","6D6E6F"',    HEX = true);
SELECT ts_lexize('hex_stop', 'abc');
Result
 ts_lexize----------- {}

See also