Edit this page

ngram

The ngram template breaks each token into overlapping fixed-length character sequences — n-grams — so searches can match on fragments rather than whole words. With the default MINGRAM of 2 and MAXGRAM of 3, the word search yields se, ea, ar, rc, ch and sea, ear, rch, letting a query find it from a partial or slightly misspelled input. This makes the template a good fit for fuzzy matching, autocomplete and typo-tolerant search.

PRESERVEORIGINAL additionally keeps the whole token alongside its grams, and STARTMARKER/ENDMARKER tag the start and end of the source token so prefixes and suffixes can be distinguished from interior matches. The index grows with the width of the MINGRAM–MAXGRAM range, so keep it as narrow as your matching needs allow.

For substring search over code, logs or identifiers, prefer sparse_ngram, which answers the same fragment queries while keeping the index far more compact.

Options

Option	Type	Default	Description
`MINGRAM`	integer	`2`	Minimum n-gram length
`MAXGRAM`	integer	`3`	Maximum n-gram length
`PRESERVEORIGINAL`	boolean	`false`	Emit original token alongside n-grams
`INPUTTYPE`	string	`'utf8'`	Input encoding: `'binary'`, `'utf8'`
`STARTMARKER`	string	—	Prefix marker at n-gram boundary
`ENDMARKER`	string	—	Suffix marker at n-gram boundary

Tokenization

For each input token the template emits every contiguous character window whose length falls between MINGRAM and MAXGRAM, sliding one character at a time across the whole word. With MINGRAM = 2 and MAXGRAM = 3, search produces every 2- and 3-character window, so a query for any of those fragments finds the word — the basis for fuzzy and typo-tolerant matching. Unlike the edge n-grams of text, these grams are not anchored to the start of the word.

Input	Options	Tokens
`search`	`MINGRAM = 2`, `MAXGRAM = 3`	`{se,sea,ea,ear,ar,arc,rc,rch,ch}`
`search`	`MINGRAM = 2`, `MAXGRAM = 3`, `PRESERVEORIGINAL = true`	`{se,sea,search,ea,ear,ar,arc,rc,rch,ch}`
`cat`	`MINGRAM = 2`, `MAXGRAM = 3`, `STARTMARKER = '^'`, `ENDMARKER = '$'`	`{^ca,^cat,cat$,at$}`

Preview the gram stream with ts_lexize:

Query

CREATE TEXT SEARCH DICTIONARY tok_ngram (    template = 'ngram',    mingram = 2,    maxgram = 3);
SELECT ts_lexize('tok_ngram', 'search');

Result

 ts_lexize---------------------------------- {se,sea,ea,ear,ar,arc,rc,rch,ch}

PRESERVEORIGINAL = true keeps the whole word in the stream alongside its grams, so an exact match still scores:

Query

CREATE TEXT SEARCH DICTIONARY tok_ngram_orig (    template = 'ngram',    mingram = 2,    maxgram = 3,    preserveoriginal = true);
SELECT ts_lexize('tok_ngram_orig', 'search');

Result

 ts_lexize----------------------------------------- {se,sea,search,ea,ear,ar,arc,rc,rch,ch}

STARTMARKER and ENDMARKER tag only the boundary grams — those at the start of the word carry the start marker and those at the end carry the end marker — so a prefix or suffix query can be distinguished from an interior match:

Query

CREATE TEXT SEARCH DICTIONARY tok_ngram_mark (    template = 'ngram',    mingram = 2,    maxgram = 3,    startmarker = '^',    endmarker = '$');
SELECT ts_lexize('tok_ngram_mark', 'cat');

Result

 ts_lexize--------------------- {^ca,^cat,cat$,at$}

Examples

Query

CREATE TEXT SEARCH DICTIONARY ngram_dict (    template = 'ngram',    mingram = 2,    maxgram = 3);

Unigrams and bigrams

Query

CREATE TEXT SEARCH DICTIONARY unigram_dict (    template = 'ngram',    mingram = 1,    maxgram = 2);

Options​

Tokenization​

Examples​

Unigrams and bigrams​

See also​

Options

Tokenization

Examples

Unigrams and bigrams

See also