Skip to main content

wordnet_synonyms

The wordnet_synonyms template expands tokens using a WordNet Prolog synonyms database supplied inline via the required SYNONYMS option. Where solr_synonyms rewrites a word to its sibling words, this template rewrites each word to the synset id(s) it belongs to — a numeric concept identifier shared by all words of the same sense.

Each record has the form s(synset_id, w_num, 'word', ss_type, sense_number, tag_count). and assigns one word to one synset. Words that appear under the same synset_id are synonyms, so they all map to that id and meet in the index even though the surface words differ. A word that appears in several synsets maps to all of their ids. A word in no record produces no tokens.

Like solr_synonyms, it is typically used inside a pipeline to broaden recall to related words.

Options

OptionTypeDefaultDescription
SYNONYMSstringrequiredInline WordNet Prolog database: one s(...) record per line

Tokenization

Given records that place fast, quick and swift under synset 100000001, each of those words is rewritten to {100000001}. Because the indexed text and the query are analyzed the same way, a search for quick reduces to 100000001 and so matches a document that contained fast. Words placed under a different synset map to that synset's id, and a word the database never mentions yields an empty token set.

InputRecordsTokens
fasts(100000001,1,'fast',v,1,0).{100000001}
quicks(100000001,2,'quick',v,1,0).{100000001}
keyboard(no record){}

The database below defines two synsets — a verb sense and a noun sense:

Query
CREATE TEXT SEARCH DICTIONARY wordnet_syn (    template = 'wordnet_synonyms',    -- words sharing a synset id are synonyms; one s(...) record per line    synonyms = 's(100000001,1,''fast'',v,1,0).s(100000001,2,''quick'',v,1,0).s(100000001,3,''swift'',v,1,0).s(100000002,1,''car'',n,1,0).s(100000002,2,''automobile'',n,1,0).');

Words sharing a synset map to its id, so synonyms meet under the same token:

Query
SELECT ts_lexize('wordnet_syn', 'fast');
Result
 ts_lexize------------- {100000001}
Query
SELECT ts_lexize('wordnet_syn', 'quick');
Result
 ts_lexize------------- {100000001}

A word the database never mentions produces no tokens:

Query
SELECT ts_lexize('wordnet_syn', 'keyboard');
Result
 ts_lexize----------- {}

See also

This page contains: