Skip to main content

BM25/TFIDF Ranking

Relevance scoring ranks search results by how well they match a query. The two most common algorithms are BM25 and TF-IDF, both based on term frequency and inverse document frequency.

See Setup for the shared dataset used in all examples.

How ranking works

Both algorithms rely on two statistical measures:

  • Term frequency (TF) — how often a search term appears in a given document. A document mentioning "galaxy" five times is considered more relevant than one mentioning it once.
  • Inverse document frequency (IDF) — how rare the term is across all indexed documents. Common words like "the" appear everywhere and carry little signal. A rare term like "paleontologist" is a much stronger match indicator.

BM25 vs TF-IDF

The key difference is that BM25 adds document length normalization and term frequency saturation on top of TF-IDF. In practice this means BM25 handles varying document lengths better — a short document with two mentions of a term can rank above a long document with three.

TF-IDFBM25
Length normalizationNoYes (parameter b)
TF saturationNo — score grows linearlyYes — diminishing returns (parameter k1)
Reads norms from indexNoYes
PerformanceFaster — fewer index readsSlightly slower due to norm lookups
Best forUniform-length documents, latency-sensitive workloadsMixed-length documents, general-purpose ranking

Use BM25 as the default. Use TF-IDF when your documents are roughly the same length or when you need lower scoring latency — TF-IDF is faster because it does not need to read document length norms from the index.

BM25 scoring

Use BM25() in the SELECT and ORDER BY clauses to rank results by relevance:

SELECT id, title, BM25() AS relevance
FROM movies_idx
WHERE PHRASE(description, 'alien')
ORDER BY BM25() DESC;

Custom parameters

Pass k1 and b to tune the ranking:

SELECT id, title, BM25(1.2, 0.75) AS relevance
FROM movies_idx
WHERE PHRASE(description, 'galaxy')
ORDER BY BM25(1.2, 0.75) DESC;
ParameterDefaultDescription
k11.2Term frequency saturation. Higher values increase the impact of term frequency
b0.75Document length normalization. 0 disables normalization, 1 fully normalizes

Favor exact matches over frequency by lowering k1, and disable length normalization with b = 0:

SELECT id, title, BM25(0.5, 0) AS relevance
FROM movies_idx
WHERE PHRASE(description, 'alien')
ORDER BY BM25(0.5, 0) DESC;

Increase k1 to reward documents that mention the term many times:

SELECT id, title, BM25(2.0, 0.75) AS relevance
FROM movies_idx
WHERE PHRASE(description, 'film')
ORDER BY BM25(2.0, 0.75) DESC;

Named variants

Specific combinations of k1 and b produce well-known BM25 variants:

VariantParametersBehavior
BM25BM25(1.2, 0.75)Default — balanced saturation and length normalization
BM15BM25(1.2, 0)No length normalization (b=0). Treats all documents equally regardless of length
BM11BM25(1.2, 1)Full length normalization (b=1). Strongly penalizes long documents
BM0BM25(0, 0)Pure IDF — term frequency is ignored entirely. Only document rarity matters
-- BM15: ignore document length differences
SELECT id, title, BM25(1.2, 0) AS relevance
FROM movies_idx
WHERE PHRASE(description, 'alien')
ORDER BY BM25(1.2, 0) DESC;

-- BM11: strongly favor shorter documents
SELECT id, title, BM25(1.2, 1) AS relevance
FROM movies_idx
WHERE PHRASE(description, 'alien')
ORDER BY BM25(1.2, 1) DESC;

Combine with filters

SELECT id, title, genre, BM25() AS relevance
FROM movies_idx
WHERE PHRASE(description, 'film') AND TERM_EQ(genre, 'drama')
ORDER BY BM25() DESC;

Combine with analytics

SELECT genre, COUNT(*) AS matches, AVG(BM25()) AS avg_relevance
FROM movies_idx
WHERE PHRASE(description, 'biggest blockbuster')
GROUP BY genre
ORDER BY avg_relevance DESC;

Pagination with stable ordering

When paginating, add a tiebreaker column to ensure consistent ordering across pages:

SELECT id, title, BM25() AS relevance
FROM movies_idx
WHERE PHRASE(description, 'alien')
ORDER BY BM25() DESC, id
LIMIT 10 OFFSET 0;

TFIDF scoring

Use TFIDF() as an alternative scoring function:

SELECT id, title, TFIDF() AS relevance
FROM movies_idx
WHERE PHRASE(description, 'alien')
ORDER BY TFIDF() DESC;

With normalization

Pass true to enable normalization:

SELECT id, title, TFIDF(true) AS relevance
FROM movies_idx
WHERE PHRASE(description, 'alien')
ORDER BY TFIDF(true) DESC;

Custom scoring

Combine relevance scores with other columns for domain-specific ranking:

SELECT id, title, BM25() AS relevance, runtime,
BM25() * LOG(runtime + 1) AS custom_score
FROM movies_idx
WHERE PHRASE(description, 'alien')
ORDER BY custom_score DESC;

Dictionary requirements

To use scoring functions, your dictionary must have FREQUENCY = true:

CREATE TEXT SEARCH DICTIONARY ranking_dict (
TEMPLATE = 'text',
LOCALE = 'en_US.UTF-8',
CASE = 'lower',
STEMMING = true,
ACCENT = false,
FREQUENCY = true,
POSITION = true
);

The FREQUENCY flag stores term frequency data in the index, which BM25 and TF-IDF need for scoring.

See also