Skip to main content

Indexing External Data

Search your data lake in place — no ETL. An inverted index can be built directly over external files — Parquet, CSV, JSON, ORC and Iceberg, on local disk or S3 — so you get full-text, vector and geospatial search over data that never gets copied into the database. You point a view at a reader function and index the view; the rows stay in the files, and only the search postings are built.

This is zero-ETL search: no pipeline to move data into a search engine, no second copy to keep in sync, no separate cluster. The files remain the source of truth; the index is a searchable view over them. It is ideal for log and event lakes, document and embedding archives, and any large Parquet/Iceberg dataset you want to search without an ingest step.

Supported sources

SourceReaderNotes
Parquetread_parquet / parquet_scanSingle file or glob; file_row_number PK
CSVread_csv / read_csv_autoByte-offset PK; full reader-option support
JSON / NDJSONread_json / read_ndjson (*_auto)Byte-offset PK
Icebergiceberg_scan(...) / catalog tableSnapshot at build time
Attached DuckDBATTACH … ; SELECT … FROM db.schema.tRead through the live attachment
Text / blobsread_text / read_blobOne document per file (or per glob entry)

Local paths and s3:// URLs (via httpfs) both work, as do globs ('…/*.parquet') spanning thousands of files across a partitioned dataset.

How it works

Define a view over the reader, then create the index on the view. The reader's row identity (file_row_number, byte offset or a (file_index, position) pair for globs) is detected automatically — no WITH (pk = ...) needed:

Query
CREATE VIEW logs AS    SELECT * FROM read_parquet('s3://my-bucket/logs/2025/*.parquet');

Attach a text search dictionary to the text columns and build the index. Columns without a dictionary are indexed verbatim:

Query
CREATE TEXT SEARCH DICTIONARY english_dict (    template = 'text',    locale = 'en_US.UTF-8',    case = 'lower',    stemming = true,    accent = false,    frequency = true,    position = true,    offset = true);
Query
CREATE INDEX logs_idx ON logs    USING inverted (id, level, service, message english_dict);

Then query the index by name, exactly like any other inverted index:

Query
-- Full-text search ranked by relevance, combined with an exact filterSELECT id, level, messageFROM logs_idxWHERE message @@ ts_phrase('out of memory') AND level @@ 'ERROR'ORDER BY BM25(logs_idx.tableoid) DESC, id;
Result
 id | level | message----+-------+------------------------------  3 | ERROR | out of memory while indexing

Only the columns the index needs are read from the files at build time — projection and predicate pruning are pushed into the reader, so indexing a wide Parquet dataset touches just the indexed columns. See the Indexing External Data cookbook recipe for an end-to-end walkthrough, and Indexing Views for the underlying fast-path, row-identity and materialization rules.

Reader parameters

Reader options are preserved: the same parameters used to build the index are replayed when columns are materialized. CSV options (delim, header, quote, nullstr, skip, compression, columns, types, dateformat, …), JSON options (format, records, columns, maximum_object_size, …) and Parquet options (binary_as_string, hive_partitioning, file_row_number, …) all round-trip.

Freshness

An external-data index is a static snapshot of the postings at CREATE INDEX time; it does not track changes to the files. When the underlying data changes, rebuild the index:

Query
DROP INDEX logs_idx;
CREATE INDEX logs_idx ON logs    USING inverted (id, level, service, message english_dict);

Materialized column values are read live from the current files, so counts and scores reflect the build-time snapshot while a materialized column reflects the file as it is now (a row removed from the source materializes as NULL).

See also