Skip to main content

ngram

The ngram template breaks each token into overlapping fixed-length character sequences — n-grams — so searches can match on fragments rather than whole words. With the default MINGRAM of 2 and MAXGRAM of 3, the word search yields se, ea, ar, rc, ch and sea, ear, rch, letting a query find it from a partial or slightly misspelled input. This makes the template a good fit for fuzzy matching, autocomplete and typo-tolerant search.

PRESERVEORIGINAL additionally keeps the whole token alongside its grams, and STARTMARKER/ENDMARKER tag the start and end of the source token so prefixes and suffixes can be distinguished from interior matches. The index grows with the width of the MINGRAMMAXGRAM range, so keep it as narrow as your matching needs allow.

For substring search over code, logs or identifiers, prefer sparse_ngram, which answers the same fragment queries while keeping the index far more compact.

Options

OptionTypeDefaultDescription
MINGRAMinteger2Minimum n-gram length
MAXGRAMinteger3Maximum n-gram length
PRESERVEORIGINALbooleanfalseEmit original token alongside n-grams
INPUTTYPEstring'utf8'Input encoding: 'binary', 'utf8'
STARTMARKERstringPrefix marker at n-gram boundary
ENDMARKERstringSuffix marker at n-gram boundary

Tokenization

For each input token the template emits every contiguous character window whose length falls between MINGRAM and MAXGRAM, sliding one character at a time across the whole word. With MINGRAM = 2 and MAXGRAM = 3, search produces every 2- and 3-character window, so a query for any of those fragments finds the word — the basis for fuzzy and typo-tolerant matching. Unlike the edge n-grams of text, these grams are not anchored to the start of the word.

InputOptionsTokens
searchMINGRAM = 2, MAXGRAM = 3{se,sea,ea,ear,ar,arc,rc,rch,ch}
searchMINGRAM = 2, MAXGRAM = 3, PRESERVEORIGINAL = true{se,sea,search,ea,ear,ar,arc,rc,rch,ch}
catMINGRAM = 2, MAXGRAM = 3, STARTMARKER = '^', ENDMARKER = '$'{^ca,^cat,cat$,at$}

Preview the gram stream with ts_lexize:

Query
CREATE TEXT SEARCH DICTIONARY tok_ngram (    template = 'ngram',    mingram = 2,    maxgram = 3);
SELECT ts_lexize('tok_ngram', 'search');
Result
 ts_lexize---------------------------------- {se,sea,ea,ear,ar,arc,rc,rch,ch}

PRESERVEORIGINAL = true keeps the whole word in the stream alongside its grams, so an exact match still scores:

Query
CREATE TEXT SEARCH DICTIONARY tok_ngram_orig (    template = 'ngram',    mingram = 2,    maxgram = 3,    preserveoriginal = true);
SELECT ts_lexize('tok_ngram_orig', 'search');
Result
 ts_lexize----------------------------------------- {se,sea,search,ea,ear,ar,arc,rc,rch,ch}

STARTMARKER and ENDMARKER tag only the boundary grams — those at the start of the word carry the start marker and those at the end carry the end marker — so a prefix or suffix query can be distinguished from an interior match:

Query
CREATE TEXT SEARCH DICTIONARY tok_ngram_mark (    template = 'ngram',    mingram = 2,    maxgram = 3,    startmarker = '^',    endmarker = '$');
SELECT ts_lexize('tok_ngram_mark', 'cat');
Result
 ts_lexize--------------------- {^ca,^cat,cat$,at$}

Examples

Query
CREATE TEXT SEARCH DICTIONARY ngram_dict (    template = 'ngram',    mingram = 2,    maxgram = 3);

Unigrams and bigrams

Query
CREATE TEXT SEARCH DICTIONARY unigram_dict (    template = 'ngram',    mingram = 1,    maxgram = 2);

See also