Skip to main content

Sparse Ngram

The sparse_ngram template indexes text for substring search — finding an arbitrary fragment anywhere inside a value, the way LIKE '%fragment%' does, but accelerated by an inverted index instead of a full scan.

It targets text where splitting on word boundaries does not help: source code, log lines, URLs, file paths, identifiers, serial numbers. For that kind of data a plain word tokenizer cannot answer "which rows contain i=42", and a fixed-width ngram dictionary that could would bloat the index with every overlapping window. sparse_ngram keeps the index compact by emitting only a small, carefully chosen set of variable-length grams, while still letting any substring query be answered exactly.

How you use it

Substring search needs two dictionaries built from this template — they differ only in the COVERING option:

  • an indexing dictionary (COVERING = false) attached to the column, and
  • a querying dictionary (COVERING = true) used to tokenize the search string.

A query then matches every row whose indexed grams contain all of the query's grams (ts_all). That conjunction returns a small candidate set, which a LIKE predicate filters down to exact matches. The index does the expensive narrowing; LIKE only runs on the few rows that survive.

Options

OptionTypeDefaultDescription
COVERINGbooleanfalseWhich side the dictionary serves. Leave false for the dictionary attached to the indexed column. Set true for the dictionary used to tokenize query strings, so their grams can be AND-ed together to find containing rows.
MAXNGRAMLENGTHinteger16Largest gram length, in bytes. Longer grams are more selective, so queries return fewer false candidates to verify, at the cost of a larger index. Minimum 3.

Inputs shorter than 3 bytes produce no grams. The template supports the FREQUENCY and NORM index features, so bm25() can rank rows by how much of the query they contain; POSITION and OFFSET are not supported.

Tokenization

Unlike ngram, which emits every fixed-width window, sparse_ngram picks a small set of variable-length grams that still let any substring query be answered. The COVERING option controls which set: the indexing side (false) stores enough grams to cover the value, while the querying side (true) emits only the few grams a search string must share with a row.

InputOptionsTokens
hello worldindexing (COVERING = false){hel,ell,llo,"lo ","o w","lo w"," wo","lo wo",wor,orl,worl,rld}
hello worldquerying (COVERING = true){hel,ell,llo,rld,worl,"lo wo"}
hello worldMAXNGRAMLENGTH = 4{hel,ell,llo,"lo ","o w","lo w"," wo",wor,orl,worl,rld}

The querying set is a strict subset of the indexing set, so requiring all of a query's grams (ts_all) selects exactly the rows whose indexed grams contain them. The verified token streams appear in the examples below.

Examples

Create the indexing dictionary and inspect the grams it stores per value with ts_lexize. It enables FREQUENCY and NORM so bm25() can rank later:

Query
CREATE TEXT SEARCH DICTIONARY code_grams (    template = 'sparse_ngram',    frequency = true,    norm = true);
SELECT ts_lexize('code_grams', 'hello world');
Result
 ts_lexize----------------------------------------------------------------- {hel,ell,llo,"lo ","o w","lo w"," wo","lo wo",wor,orl,worl,rld}

Create the querying dictionary with COVERING = true. For the same string it emits a much smaller set — the grams to require together when searching:

Query
CREATE TEXT SEARCH DICTIONARY code_grams_q (    template = 'sparse_ngram',    covering = true);
SELECT ts_lexize('code_grams_q', 'hello world');
Result
 ts_lexize-------------------------------- {hel,ell,llo,rld,worl,"lo wo"}

Attach the indexing dictionary to a column, then search for a substring by requiring all of its query grams and confirming exactness with LIKE. ts_highlight(code) wraps the matched region so you can see what was found:

Query
CREATE TABLE snippets (id INT, code TEXT);
INSERT INTO snippets VALUES    (1, 'for(int i=42; i<n; ++i) total += data[i];'),    (2, 'for(int j=0; j<n; ++j) total -= data[j];'),    (3, 'while(n--) { sum += *ptr++; }'),    (4, 'hello world');
CREATE INDEX snippets_idx ON snippets USING inverted (id, code code_grams);
SELECT id, ts_highlight(code) AS matchFROM snippets_idxWHERE code @@ ts_all(ts_lexize('code_grams_q', 'for(int i=42'))  AND code LIKE '%for(int i=42%'ORDER BY id;
Result
 id | match----+------------------------------------------------  1 | <b>for(int i=42</b>; i<n; ++i) total += data[i

Requiring only some of the query grams with ts_any(..., k) instead of all of them makes the search approximate. Rows that share more grams with the query rank higher under bm25(), so the closest matches surface first, and ts_highlight(code) shows how much of each row matched:

Query
CREATE TABLE ranked (id INT, code TEXT);
INSERT INTO ranked VALUES    (1, 'while l < r: m = (l + r) // 2'),    (2, 'some noise prefix; m = (l + r) // 2'),    (3, 'much longer noise that matches nothing at all; r) // 2'),    (4, 'completely unrelated text');
CREATE INDEX ranked_idx ON ranked USING inverted (id, code code_grams);
SELECT id, ts_highlight(code) AS matchFROM ranked_idxWHERE code @@ ts_any(ts_lexize('code_grams_q', 'while l < r: m = (l + r) // 2'), 1)ORDER BY bm25(ranked_idx.tableoid) DESC;
Result
 id | match----+---------------------------------------------------------------  1 | <b>while l < r: m = (l + r) // 2</b>  2 | some noise prefix; <b>m = (l + r) // 2</b>  3 | much longer noise that matches nothing at all; <b>r) // 2</b>

Deriving the querying dictionary

Rather than configuring the querying dictionary by hand, derive it from the indexing one with copy_from and flip COVERING. This keeps the two in sync as the indexing options evolve:

Query
CREATE TEXT SEARCH DICTIONARY code_grams_q2 (    template = 'copy_from',    from = 'code_grams',    covering = true);
SELECT ts_lexize('code_grams_q2', 'for(int i=42');
Result
 ts_lexize------------------------ {for(i,"(int i=4",=42}

Tuning gram length

Lowering MAXNGRAMLENGTH caps how long a gram can grow. Here the longer lo wo gram the default keeps is dropped, yielding a smaller index at the cost of less selective queries:

Query
CREATE TEXT SEARCH DICTIONARY code_grams_short (    template = 'sparse_ngram',    maxngramlength = 4);
SELECT ts_lexize('code_grams_short', 'hello world');
Result
 ts_lexize--------------------------------------------------------- {hel,ell,llo,"lo ","o w","lo w"," wo",wor,orl,worl,rld}

See also