Indexing External Data
SereneDB can create inverted indexes over external files — Parquet, CSV or ORC files on local disk or S3 — enabling full-text search, phrase queries, fuzzy matching and relevance ranking without importing data into the database. You expose the files through a view over a reader function, then index the view.
Expose the files through a view
Define a view backed by a remote file:
CREATE VIEW logs AS SELECT * FROM read_parquet('s3://my-bucket/logs/2025/*.parquet');Local files work the same way:
CREATE VIEW logs_csv AS SELECT * FROM read_csv('/data/logs/*.csv');Create a text search dictionary
Define how text columns should be tokenized:
CREATE TEXT SEARCH DICTIONARY english_dict ( template = 'text', locale = 'en_US.UTF-8', case = 'lower', stemming = true, accent = false, frequency = true, position = true, offset = true);Build an inverted index
Create the index over the view. Columns without a dictionary are indexed as-is (exact matching); columns with a dictionary get full-text processing:
CREATE INDEX logs_idx ON logs USING inverted (id, level, service, message english_dict);Query the index
Once built, query the index by name — just like any other inverted index:
-- Full-text search ranked by relevance, combined with an exact filterSELECT id, level, messageFROM logs_idxWHERE message @@ ts_phrase('out of memory') AND level @@ 'ERROR'ORDER BY BM25(logs_idx.tableoid) DESC, id; id | level | message----+-------+------------------------------ 3 | ERROR | out of memory while indexingRebuild the index
An external-data index is a static snapshot of the files at build time. When the underlying files change, rebuild the index:
DROP INDEX logs_idx;
CREATE INDEX logs_idx ON logs USING inverted (id, level, service, message english_dict);Use cases
- Log analysis — search through terabytes of Parquet log files on S3 without loading them into the database.
- Data lake search — add full-text search to your existing data lake files.
- Archival queries — index historical data that rarely changes but needs to be searchable.
- Hybrid pipelines — combine external data with local tables in the same query.