Edit this page

stopwords

The stopwords template removes the words listed in STOPWORDS from the token stream rather than producing tokens of its own. Dropping very common words (the, a, is) shrinks the index and keeps high-frequency terms from dominating relevance scores.

Because it is a filter, it operates on the output of an earlier tokenizer and is therefore used as a stage inside a pipeline, after a template such as text or segmentation. Set HEX = true when the stop words are supplied as hex-encoded byte strings. Note that text can filter stop words on its own through its STOPWORDS option — this template is for applying the same filtering within a custom pipeline.

Options

Option	Type	Default	Description
`STOPWORDS`	string list	—	Stop words (e.g., `'"the","a","an"'`)
`HEX`	boolean	`false`	Treat stop words as hex-encoded strings

Tokenization

stopwords compares each token it receives against the list and drops the ones that match, passing everything else through unchanged. A token that is itself a stop word is removed, leaving no output; a token that is not in the list survives.

Input	STOPWORDS	Output tokens
`the`	`"the","a","an","is"`	(empty — removed)
`cat`	`"the","a","an","is"`	`cat`

Query

CREATE TEXT SEARCH DICTIONARY stop_filter (    template = 'stopwords',    stopwords = '"the","a","an","is"');
SELECT ts_lexize('stop_filter', 'the');

Result

 ts_lexize----------- {}

Filtering inside a pipeline

In practice stopwords follows a tokenizer. A pipeline that splits on spaces and then filters drops the common words from a phrase while keeping the rest:

Input	Pipeline	Output tokens
`the cat is a animal`	`delimiter` (space) → `stopwords`	`cat`, `animal`

Query

CREATE TEXT SEARCH DICTIONARY stop_pipeline (    template = 'pipeline',    step1_template = 'delimiter',    step1_delimiter = ' ',    step2_template = 'stopwords',    step2_stopwords = '"the","a","an","is"');
SELECT ts_lexize('stop_pipeline', 'the cat is a animal');

Result

 ts_lexize-------------- {cat,animal}

Hex-encoded stopwords

With HEX = true the stop words are decoded from hex before matching, so 616263 filters the token abc:

Query

CREATE TEXT SEARCH DICTIONARY hex_stop (    template = 'stopwords',    stopwords = '"616263","6D6E6F"',    HEX = true);
SELECT ts_lexize('hex_stop', 'abc');

Result

 ts_lexize----------- {}

Options​

Tokenization​

Filtering inside a pipeline​

Hex-encoded stopwords​

See also​

Options

Tokenization

Filtering inside a pipeline

Hex-encoded stopwords

See also