Skip to main content

Maintenance & Introspection

An inverted index is eventually consistent. Writes to the base table are buffered and become searchable only after the index is refreshed; segments are compacted in the background to keep queries fast; and per-term statistics used by scorers can be recomputed. This page covers that lifecycle and how to inspect an index.

Visibility and the refresh model

A newly written row is not searchable until a refresh publishes it to readers. There are two ways a refresh happens:

  • Automatically — a background thread refreshes each index on an interval (refresh_interval, default 1000 ms).
  • Explicitly — run VACUUM (REFRESH_TABLE) to publish pending writes immediately. This is what you want right after a bulk load, before querying:
Query
VACUUM (REFRESH_TABLE) my_logs;
SELECT id FROM logs_idx WHERE msg @@ 'disk' ORDER BY id;
Result
 id----  1

Background intervals

Three index WITH options control the background lifecycle (set at CREATE INDEX); 0 disables each:

OptionDefaultControls
refresh_interval1000Milliseconds between automatic refreshes
compaction_interval1000Milliseconds between automatic segment compactions
cleanup_interval_step1Commit ticks between cleanup passes

Manual maintenance with VACUUM

Each operation comes in a family scoped to a single index, a table, a schema, a database, or instance-wide (*_ALL, which takes no argument). At most one maintenance option may appear per VACUUM statement.

OperationFormsEffect
RefreshREFRESH_INDEX · REFRESH_TABLE · REFRESH_SCHEMA · REFRESH_DATABASE · REFRESH_ALLPublish pending writes to readers
CompactCOMPACT_INDEX · COMPACT_TABLE · COMPACT_SCHEMA · COMPACT_DATABASE · COMPACT_ALLMerge segments to reclaim space and speed queries
Recompute statisticsRECOMPUTE_STATS_COLUMN · RECOMPUTE_STATS_TABLE · RECOMPUTE_STATS_SCHEMA · RECOMPUTE_STATS_DATABASE · RECOMPUTE_STATS_ALLRebuild the term statistics used by relevance scoring and planning
Query
VACUUM (COMPACT_TABLE) my_logs;
Query
VACUUM (RECOMPUTE_STATS_TABLE) my_logs;

Recompute statistics after large changes in data distribution so that relevance scores and planning stay accurate.

Rebuilding

A view- or external-data-backed index is a static snapshot taken at CREATE INDEX time — it does not track later changes to its source. To pick up new data, rebuild it with DROP INDEX followed by CREATE INDEX.

Schema changes on an indexed table

An inverted index pins every column it reads, including columns reached only through an indexed expression — for example a struct sub-field such as (s['id']::INTEGER) pins the whole s column.

  • RENAME column, table or index — allowed; the index keys columns by id and follows the rename automatically.
  • ADD COLUMN — allowed; it is index-neutral (existing rows are backfilled).
  • ALTER COLUMN … TYPE on a pinned column — rejected; the index stores values of the old type. Drop the index first, change the type, then recreate it.
  • ADD / DROP / RENAME of a struct field on a pinned column — rejected for the same reason, even when the indexed expression targets a different sub-field (the whole column is pinned). Drop the index first.
  • DROP COLUMN on a pinned column — allowed; it cascade-drops every index that covers the column.

Performance

Tuning is mostly about the background cadence and segment layout; use only the options that exist:

  • Refresh vs. compaction cadence — lower refresh_interval for fresher results, raise it (or 0) to reduce overhead on write-heavy tables; compaction_interval / cleanup_interval_step govern how aggressively segments merge.
  • Row-group sizesrow_group_size (stored INCLUDEd columns) and norm_row_group_size (norm columns) control the columnstore batch size.
  • Build then index — for a bulk load, create the table, load the data, then create the index; this produces a more compact index than loading into an already-indexed table.
  • Top-K — set optimize_top_k to accelerate ORDER BY <scorer> … LIMIT k.

Session settings

Beyond the per-index WITH options, a few sdb_-prefixed session settings tune search at query time. Set them per connection with SET (and restore with RESET); they affect only the current session:

SettingDefaultEffect
sdb_disable_top_k_optimizationfalseWhen true, the optimizer does not pull ORDER BY <scorer> DESC LIMIT k into the index scan, so WAND top-K pruning never engages. Useful to A/B the optimization or work around a plan regression.
sdb_scored_terms_limit1024Maximum number of terms considered for scoring in multi-term filters. Higher values give more accurate IDF-style scoring at the cost of memory and per-query work; 0 disables scored-term collection entirely.
sdb_ef_search0Per-session override for the HNSW search-time neighbourhood size (efSearch) used by vector search. Higher = better recall, slower queries; 0 uses the query's LIMIT (Top-K) value instead.
SET sdb_ef_search = 64;          -- widen the HNSW search beam for this session
SET sdb_scored_terms_limit = 4096;
-- … run queries …
RESET sdb_ef_search; -- back to the default

Introspection

List the inverted indexes with the duckdb_indexes() table function:

Query
SELECT index_name, table_nameFROM duckdb_indexes()WHERE table_name = 'my_logs';
Result
 index_name | table_name------------+------------ logs_idx   | my_logs

The standard pg_indexes view also lists them (alongside the primary-key index), with the CREATE INDEX statement in its indexdef column:

Query
SELECT indexnameFROM pg_indexesWHERE tablename = 'my_logs'ORDER BY indexname;
Result
 indexname-------------- logs_idx my_logs_pkey

See also