Indexing Views
An inverted index can be built over a view, not just a base table. This is how SereneDB searches data it does not own a primary copy of — most importantly external files (Parquet, CSV, JSON on disk or S3) exposed through a view.
A view has no primary key of its own, so two things have to be resolved: a row identity (so each indexed document can be located again) and, when a query needs a column the index does not hold, a way to fetch that column on demand. SereneDB handles both through a fast-path source.
Counts, relevance scores and any indexed or INCLUDEd column come straight from the frozen index; any other column is fetched live from the source by row identity at query time. That split — frozen postings, live values — is the whole model.
All examples below use this setup — a base table and a view over it:
CREATE TABLE base_docs (id INTEGER PRIMARY KEY, body VARCHAR, extra VARCHAR);
CREATE TEXT SEARCH DICTIONARY view_en ( template = 'text', locale = 'en_US.UTF-8', case = 'lower', stemming = false, accent = false, frequency = true, position = true);
INSERT INTO base_docs VALUES (1, 'quick brown fox', 'alpha'), (2, 'lazy dog sleeps', 'beta'), (3, 'quick red fox', 'gamma');
CREATE VIEW v_docs AS SELECT id, body, extra FROM base_docs;
CREATE INDEX v_docs_idx ON v_docs USING inverted (id, body view_en);Row identity
How the index derives a stable per-row identity depends on the view body:
| View body | Row identity | Explicit pk? |
|---|---|---|
SELECT … FROM base_table (has PRIMARY KEY) | The base table's primary key | Auto |
SELECT … FROM base_table (no PK) | The hidden row id | Auto |
SELECT … FROM read_parquet('file') | The reader's file_row_number | Auto |
read_csv / read_json single file | The byte offset within the file | Auto |
Reader over a glob ('…/*.parquet') | (file_index, position) pair | Auto |
| Attached DuckDB / Iceberg table | The source row id / (file_index, row) | Auto |
Generic body (inline VALUES, UNION ALL, joins) | None detectable | Required — WITH (pk = '…') |
Fast-path sources
A fast-path source is a view body SereneDB recognizes well enough to both derive the row identity and re-read columns by it later. The recognized sources are: SereneDB base tables, read_parquet / read_csv / read_json (and their *_auto / parquet_scan / read_ndjson variants), Iceberg tables, attached DuckDB tables and read_text / read_blob.
The view body may shape the source freely and still qualify as fast-path:
- a column subset, reordering or renaming (
SELECT body, id FROM …,SELECT a AS x FROM …); - a cast on the indexed column (
SELECT id::BIGINT, body FROM …); - an indexed expression over source columns (
upper(body),(json ->> 'b')); - a
WHERE/ORDER BY/LIMITin the view body (the index then captures only the rows the view emits).
A body with no fast-path leaf — inline VALUES, a UNION ALL, a join — is a generic view: it must be given WITH (pk = '…') and supports only the non-materializing queries below.
What runs without materialization
These queries are answered entirely from the index, never touching the source — so they work on every view shape, including generic views:
COUNT(*)/COUNT(1), with or without a@@filter;- full-text
@@filters and secondary scalar filters (=,<,BETWEEN,IN) that get pushed into the index scan; - relevance scores —
BM25,TFIDF— and the indextableoid; ts_offsets;- projections of indexed columns and of
INCLUDEd columns and aggregates over them.
SELECT count(*) FROM v_docs_idx WHERE body @@ ts_phrase('quick'); count------- 2SELECT id FROM v_docs_idxWHERE body @@ ts_phrase('quick')ORDER BY BM25(v_docs_idx.tableoid) DESC, id; id---- 1 3Materializing real columns
Selecting a real source column that is neither indexed nor INCLUDEd materializes it: the index hands back the row identities for the matches, and SereneDB re-reads those columns from the source through the fast-path lookup. The column values are not stored in the index — they are fetched live at query time:
SELECT id, extra FROM v_docs_idx WHERE body @@ ts_phrase('quick') ORDER BY id; id | extra----+------- 1 | alpha 3 | gammaBecause the read is live, if the underlying source changed after the index was built, materialized values reflect the current source: rows deleted from the source come back as NULL, and edited content shows its new value (or raises an error if a file became unreadable). Counts and scores, by contrast, still reflect the frozen build-time snapshot.
Generic views
A generic view (no fast-path source) still indexes and answers the non-materializing queries, but it must be given an explicit primary key:
SELECT count(*) FROM u_docs_idx WHERE body @@ ts_phrase('quick'); count------- 1Selecting its real columns is not supported and raises an error:
SELECT id, body FROM u_docs_idx WHERE body @@ ts_phrase('quick');db error: ERROR: materialising real columns from this view-backed inverted index is not yet supported -- view body must be a simple `SELECT * FROM <reader>(literal_args)` over a recognised fast-path source (read_parquet/csv/json/...)INCLUDE columns
INCLUDEd columns on a view are stored in the index's columnstore, so they are returned without materializing the source — the same as on a base table. Use INCLUDE for columns you frequently return but never search, to avoid the per-row source lookup.
Snapshot and isolation
A view-backed index is a static snapshot: its postings are captured at CREATE INDEX time and are not refreshed when the source changes (there is no background DML tracking as there is for base tables). To pick up new data, rebuild the index. A reader transaction keeps a consistent view of the index even if the underlying view is dropped concurrently.
See also
- Indexing External Data — Parquet/CSV/JSON on disk or S3
- Inverted Index — creating an index · Full-Text Search
- CREATE INDEX … USING inverted