Indexing External Data
Search your data lake in place — no ETL. An inverted index can be built directly over external files — Parquet, CSV, JSON, ORC and Iceberg, on local disk or S3 — so you get full-text, vector and geospatial search over data that never gets copied into the database. You point a view at a reader function and index the view; the rows stay in the files, and only the search postings are built.
This is zero-ETL search: no pipeline to move data into a search engine, no second copy to keep in sync, no separate cluster. The files remain the source of truth; the index is a searchable view over them. It is ideal for log and event lakes, document and embedding archives, and any large Parquet/Iceberg dataset you want to search without an ingest step.
Supported sources
| Source | Reader | Notes |
|---|---|---|
| Parquet | read_parquet / parquet_scan | Single file or glob; file_row_number PK |
| CSV | read_csv / read_csv_auto | Byte-offset PK; full reader-option support |
| JSON / NDJSON | read_json / read_ndjson (*_auto) | Byte-offset PK |
| Iceberg | iceberg_scan(...) / catalog table | Snapshot at build time |
| Attached DuckDB | ATTACH … ; SELECT … FROM db.schema.t | Read through the live attachment |
| Text / blobs | read_text / read_blob | One document per file (or per glob entry) |
Local paths and s3:// URLs (via httpfs) both work, as do globs ('…/*.parquet') spanning thousands of files across a partitioned dataset.
How it works
Define a view over the reader, then create the index on the view. The reader's row identity (file_row_number, byte offset or a (file_index, position) pair for globs) is detected automatically — no WITH (pk = ...) needed:
CREATE VIEW logs AS SELECT * FROM read_parquet('s3://my-bucket/logs/2025/*.parquet');Attach a text search dictionary to the text columns and build the index. Columns without a dictionary are indexed verbatim:
CREATE TEXT SEARCH DICTIONARY english_dict ( template = 'text', locale = 'en_US.UTF-8', case = 'lower', stemming = true, accent = false, frequency = true, position = true, offset = true);CREATE INDEX logs_idx ON logs USING inverted (id, level, service, message english_dict);Then query the index by name, exactly like any other inverted index:
-- Full-text search ranked by relevance, combined with an exact filterSELECT id, level, messageFROM logs_idxWHERE message @@ ts_phrase('out of memory') AND level @@ 'ERROR'ORDER BY BM25(logs_idx.tableoid) DESC, id; id | level | message----+-------+------------------------------ 3 | ERROR | out of memory while indexingOnly the columns the index needs are read from the files at build time — projection and predicate pruning are pushed into the reader, so indexing a wide Parquet dataset touches just the indexed columns. See the Indexing External Data cookbook recipe for an end-to-end walkthrough, and Indexing Views for the underlying fast-path, row-identity and materialization rules.
Reader parameters
Reader options are preserved: the same parameters used to build the index are replayed when columns are materialized. CSV options (delim, header, quote, nullstr, skip, compression, columns, types, dateformat, …), JSON options (format, records, columns, maximum_object_size, …) and Parquet options (binary_as_string, hive_partitioning, file_row_number, …) all round-trip.
Freshness
An external-data index is a static snapshot of the postings at CREATE INDEX time; it does not track changes to the files. When the underlying data changes, rebuild the index:
DROP INDEX logs_idx;
CREATE INDEX logs_idx ON logs USING inverted (id, level, service, message english_dict);Materialized column values are read live from the current files, so counts and scores reflect the build-time snapshot while a materialized column reflects the file as it is now (a row removed from the source materializes as NULL).
See also
- Indexing Views — row identity, fast paths, materialization
- Inverted Index · Full-Text Search · Vector Search · Geospatial Search
- Indexing External Data recipe