BM25/TFIDF Ranking
Relevance scoring ranks search results by how well they match a query. The two most common algorithms are BM25 and TF-IDF, both based on term frequency and inverse document frequency.
See Setup for the shared dataset used in all examples.
How ranking works
Both algorithms rely on two statistical measures:
- Term frequency (TF) — how often a search term appears in a given document. A document mentioning "galaxy" five times is considered more relevant than one mentioning it once.
- Inverse document frequency (IDF) — how rare the term is across all indexed documents. Common words like "the" appear everywhere and carry little signal. A rare term like "paleontologist" is a much stronger match indicator.
BM25 vs TF-IDF
The key difference is that BM25 adds document length normalization and term frequency saturation on top of TF-IDF. In practice this means BM25 handles varying document lengths better — a short document with two mentions of a term can rank above a long document with three.
| TF-IDF | BM25 | |
|---|---|---|
| Length normalization | No | Yes (parameter b) |
| TF saturation | No — score grows linearly | Yes — diminishing returns (parameter k1) |
| Reads norms from index | No | Yes |
| Performance | Faster — fewer index reads | Slightly slower due to norm lookups |
| Best for | Uniform-length documents, latency-sensitive workloads | Mixed-length documents, general-purpose ranking |
Use BM25 as the default. Use TF-IDF when your documents are roughly the same length or when you need lower scoring latency — TF-IDF is faster because it does not need to read document length norms from the index.
BM25 scoring
Use BM25() in the SELECT and ORDER BY clauses to rank results by relevance:
SELECT id, title, BM25() AS relevance
FROM movies_idx
WHERE PHRASE(description, 'alien')
ORDER BY BM25() DESC;
Custom parameters
Pass k1 and b to tune the ranking:
SELECT id, title, BM25(1.2, 0.75) AS relevance
FROM movies_idx
WHERE PHRASE(description, 'galaxy')
ORDER BY BM25(1.2, 0.75) DESC;
| Parameter | Default | Description |
|---|---|---|
k1 | 1.2 | Term frequency saturation. Higher values increase the impact of term frequency |
b | 0.75 | Document length normalization. 0 disables normalization, 1 fully normalizes |
Favor exact matches over frequency by lowering k1, and disable length normalization with b = 0:
SELECT id, title, BM25(0.5, 0) AS relevance
FROM movies_idx
WHERE PHRASE(description, 'alien')
ORDER BY BM25(0.5, 0) DESC;
Increase k1 to reward documents that mention the term many times:
SELECT id, title, BM25(2.0, 0.75) AS relevance
FROM movies_idx
WHERE PHRASE(description, 'film')
ORDER BY BM25(2.0, 0.75) DESC;
Named variants
Specific combinations of k1 and b produce well-known BM25 variants:
| Variant | Parameters | Behavior |
|---|---|---|
| BM25 | BM25(1.2, 0.75) | Default — balanced saturation and length normalization |
| BM15 | BM25(1.2, 0) | No length normalization (b=0). Treats all documents equally regardless of length |
| BM11 | BM25(1.2, 1) | Full length normalization (b=1). Strongly penalizes long documents |
| BM0 | BM25(0, 0) | Pure IDF — term frequency is ignored entirely. Only document rarity matters |
-- BM15: ignore document length differences
SELECT id, title, BM25(1.2, 0) AS relevance
FROM movies_idx
WHERE PHRASE(description, 'alien')
ORDER BY BM25(1.2, 0) DESC;
-- BM11: strongly favor shorter documents
SELECT id, title, BM25(1.2, 1) AS relevance
FROM movies_idx
WHERE PHRASE(description, 'alien')
ORDER BY BM25(1.2, 1) DESC;
Combine with filters
SELECT id, title, genre, BM25() AS relevance
FROM movies_idx
WHERE PHRASE(description, 'film') AND TERM_EQ(genre, 'drama')
ORDER BY BM25() DESC;
Combine with analytics
SELECT genre, COUNT(*) AS matches, AVG(BM25()) AS avg_relevance
FROM movies_idx
WHERE PHRASE(description, 'biggest blockbuster')
GROUP BY genre
ORDER BY avg_relevance DESC;
Pagination with stable ordering
When paginating, add a tiebreaker column to ensure consistent ordering across pages:
SELECT id, title, BM25() AS relevance
FROM movies_idx
WHERE PHRASE(description, 'alien')
ORDER BY BM25() DESC, id
LIMIT 10 OFFSET 0;
TFIDF scoring
Use TFIDF() as an alternative scoring function:
SELECT id, title, TFIDF() AS relevance
FROM movies_idx
WHERE PHRASE(description, 'alien')
ORDER BY TFIDF() DESC;
With normalization
Pass true to enable normalization:
SELECT id, title, TFIDF(true) AS relevance
FROM movies_idx
WHERE PHRASE(description, 'alien')
ORDER BY TFIDF(true) DESC;
Custom scoring
Combine relevance scores with other columns for domain-specific ranking:
SELECT id, title, BM25() AS relevance, runtime,
BM25() * LOG(runtime + 1) AS custom_score
FROM movies_idx
WHERE PHRASE(description, 'alien')
ORDER BY custom_score DESC;
Dictionary requirements
To use scoring functions, your dictionary must have FREQUENCY = true:
CREATE TEXT SEARCH DICTIONARY ranking_dict (
TEMPLATE = 'text',
LOCALE = 'en_US.UTF-8',
CASE = 'lower',
STEMMING = true,
ACCENT = false,
FREQUENCY = true,
POSITION = true
);
The FREQUENCY flag stores term frequency data in the index, which BM25 and TF-IDF need for scoring.
See also
- Phrase and Proximity Search — finding phrase matches to rank
- CREATE TEXT SEARCH DICTIONARY — frequency and position flags