wordnet_synonyms
The wordnet_synonyms template expands tokens using a WordNet Prolog synonyms database supplied inline via the required SYNONYMS option. Where solr_synonyms rewrites a word to its sibling words, this template rewrites each word to the synset id(s) it belongs to — a numeric concept identifier shared by all words of the same sense.
Each record has the form s(synset_id, w_num, 'word', ss_type, sense_number, tag_count). and assigns one word to one synset. Words that appear under the same synset_id are synonyms, so they all map to that id and meet in the index even though the surface words differ. A word that appears in several synsets maps to all of their ids. A word in no record produces no tokens.
Like solr_synonyms, it is typically used inside a pipeline to broaden recall to related words.
Options
| Option | Type | Default | Description |
|---|---|---|---|
SYNONYMS | string | required | Inline WordNet Prolog database: one s(...) record per line |
Tokenization
Given records that place fast, quick and swift under synset 100000001, each of those words is rewritten to {100000001}. Because the indexed text and the query are analyzed the same way, a search for quick reduces to 100000001 and so matches a document that contained fast. Words placed under a different synset map to that synset's id, and a word the database never mentions yields an empty token set.
| Input | Records | Tokens |
|---|---|---|
fast | s(100000001,1,'fast',v,1,0). | {100000001} |
quick | s(100000001,2,'quick',v,1,0). | {100000001} |
keyboard | (no record) | {} |
The database below defines two synsets — a verb sense and a noun sense:
CREATE TEXT SEARCH DICTIONARY wordnet_syn ( template = 'wordnet_synonyms', -- words sharing a synset id are synonyms; one s(...) record per line synonyms = 's(100000001,1,''fast'',v,1,0).s(100000001,2,''quick'',v,1,0).s(100000001,3,''swift'',v,1,0).s(100000002,1,''car'',n,1,0).s(100000002,2,''automobile'',n,1,0).');Words sharing a synset map to its id, so synonyms meet under the same token:
SELECT ts_lexize('wordnet_syn', 'fast'); ts_lexize------------- {100000001}SELECT ts_lexize('wordnet_syn', 'quick'); ts_lexize------------- {100000001}A word the database never mentions produces no tokens:
SELECT ts_lexize('wordnet_syn', 'keyboard'); ts_lexize----------- {}See also
solr_synonyms— Solr-format synonymspipeline— chain a tokenizer ahead of the synonym filter- CREATE TEXT SEARCH DICTIONARY