Skip to main content

Indexing Views

An inverted index can be built over a view, not just a base table. This is how SereneDB searches data it does not own a primary copy of — most importantly external files (Parquet, CSV, JSON on disk or S3) exposed through a view.

A view has no primary key of its own, so two things have to be resolved: a row identity (so each indexed document can be located again) and, when a query needs a column the index does not hold, a way to fetch that column on demand. SereneDB handles both through a fast-path source.

Counts, relevance scores and any indexed or INCLUDEd column come straight from the frozen index; any other column is fetched live from the source by row identity at query time. That split — frozen postings, live values — is the whole model.

All examples below use this setup — a base table and a view over it:

Query
CREATE TABLE base_docs (id INTEGER PRIMARY KEY, body VARCHAR, extra VARCHAR);
CREATE TEXT SEARCH DICTIONARY view_en (    template = 'text',    locale = 'en_US.UTF-8',    case = 'lower',    stemming = false,    accent = false,    frequency = true,    position = true);
INSERT INTO base_docs VALUES    (1, 'quick brown fox', 'alpha'),    (2, 'lazy dog sleeps', 'beta'),    (3, 'quick red fox', 'gamma');
CREATE VIEW v_docs AS SELECT id, body, extra FROM base_docs;
CREATE INDEX v_docs_idx ON v_docs USING inverted (id, body view_en);

Row identity

How the index derives a stable per-row identity depends on the view body:

View bodyRow identityExplicit pk?
SELECT … FROM base_table (has PRIMARY KEY)The base table's primary keyAuto
SELECT … FROM base_table (no PK)The hidden row idAuto
SELECT … FROM read_parquet('file')The reader's file_row_numberAuto
read_csv / read_json single fileThe byte offset within the fileAuto
Reader over a glob ('…/*.parquet')(file_index, position) pairAuto
Attached DuckDB / Iceberg tableThe source row id / (file_index, row)Auto
Generic body (inline VALUES, UNION ALL, joins)None detectableRequiredWITH (pk = '…')

Fast-path sources

A fast-path source is a view body SereneDB recognizes well enough to both derive the row identity and re-read columns by it later. The recognized sources are: SereneDB base tables, read_parquet / read_csv / read_json (and their *_auto / parquet_scan / read_ndjson variants), Iceberg tables, attached DuckDB tables and read_text / read_blob.

The view body may shape the source freely and still qualify as fast-path:

  • a column subset, reordering or renaming (SELECT body, id FROM …, SELECT a AS x FROM …);
  • a cast on the indexed column (SELECT id::BIGINT, body FROM …);
  • an indexed expression over source columns (upper(body), (json ->> 'b'));
  • a WHERE / ORDER BY / LIMIT in the view body (the index then captures only the rows the view emits).

A body with no fast-path leaf — inline VALUES, a UNION ALL, a join — is a generic view: it must be given WITH (pk = '…') and supports only the non-materializing queries below.

What runs without materialization

These queries are answered entirely from the index, never touching the source — so they work on every view shape, including generic views:

  • COUNT(*) / COUNT(1), with or without a @@ filter;
  • full-text @@ filters and secondary scalar filters (=, <, BETWEEN, IN) that get pushed into the index scan;
  • relevance scores — BM25, TFIDF — and the index tableoid;
  • ts_offsets;
  • projections of indexed columns and of INCLUDEd columns and aggregates over them.
Query
SELECT count(*) FROM v_docs_idx WHERE body @@ ts_phrase('quick');
Result
 count-------     2
Query
SELECT id FROM v_docs_idxWHERE body @@ ts_phrase('quick')ORDER BY BM25(v_docs_idx.tableoid) DESC, id;
Result
 id----  1  3

Materializing real columns

Selecting a real source column that is neither indexed nor INCLUDEd materializes it: the index hands back the row identities for the matches, and SereneDB re-reads those columns from the source through the fast-path lookup. The column values are not stored in the index — they are fetched live at query time:

Query
SELECT id, extra FROM v_docs_idx WHERE body @@ ts_phrase('quick') ORDER BY id;
Result
 id | extra----+-------  1 | alpha  3 | gamma

Because the read is live, if the underlying source changed after the index was built, materialized values reflect the current source: rows deleted from the source come back as NULL, and edited content shows its new value (or raises an error if a file became unreadable). Counts and scores, by contrast, still reflect the frozen build-time snapshot.

Generic views

A generic view (no fast-path source) still indexes and answers the non-materializing queries, but it must be given an explicit primary key:

Query
SELECT count(*) FROM u_docs_idx WHERE body @@ ts_phrase('quick');
Result
 count-------     1

Selecting its real columns is not supported and raises an error:

Query
SELECT id, body FROM u_docs_idx WHERE body @@ ts_phrase('quick');
Result
db error: ERROR: materialising real columns from this view-backed inverted index is not yet supported -- view body must be a simple `SELECT * FROM <reader>(literal_args)` over a recognised fast-path source (read_parquet/csv/json/...)

INCLUDE columns

INCLUDEd columns on a view are stored in the index's columnstore, so they are returned without materializing the source — the same as on a base table. Use INCLUDE for columns you frequently return but never search, to avoid the per-row source lookup.

Snapshot and isolation

A view-backed index is a static snapshot: its postings are captured at CREATE INDEX time and are not refreshed when the source changes (there is no background DML tracking as there is for base tables). To pick up new data, rebuild the index. A reader transaction keeps a consistent view of the index even if the underlying view is dropped concurrently.

See also