Skip to main content

nearest_neighbors

The nearest_neighbors template uses a pre-trained embedding model to expand each input token with the terms whose vectors lie closest to it. In effect it enriches the text with semantically related words, so a document indexed through this template can be found by synonyms and near-synonyms it never literally contained — a recall-oriented complement to exact full-text matching.

How it works

For every token in the input the analyzer asks a fastText embedding model loaded from modellocation for its topk nearest neighbors and emits those neighbor words as additional terms. Applied at index time it broadens what a document can match; applied to the query it broadens what the query reaches. For example a cooking model might expand "cake" into related terms such as three-tiered and wham.

The model file is required and must be reachable from the server process at the path given in modellocation; the dictionary cannot be created without a loadable model. Where classification tags a document with predicted category labels, nearest_neighbors instead grows its vocabulary with related terms.

Options

OptionTypeDefaultDescription
modellocationstringrequiredPath to the fastText model file, reachable from the server
topkinteger1Number of nearest neighbors to emit per input token

Usage

Point modellocation at a trained fastText model and choose how many neighbors to add per token:

Query
CREATE TEXT SEARCH DICTIONARY nn_dict (    template = 'nearest_neighbors',    MODELLOCATION = '/models/cooking.bin',    TOPK = 2);

Attached to a text column in a USING inverted index, the dictionary indexes each document under both its own tokens and their nearest neighbors, widening recall. The emitted neighbors depend on the model; the example above uses a small cooking model.

See also