Edit this page

Indexing External Data

Search your data lake in place — no ETL. An inverted index can be built directly over external files — Parquet, CSV, JSON, ORC and Iceberg, on local disk or S3 — so you get full-text, vector and geospatial search over data that never gets copied into the database. You point a view at a reader function and index the view; the rows stay in the files, and only the search postings are built.

This is zero-ETL search: no pipeline to move data into a search engine, no second copy to keep in sync, no separate cluster. The files remain the source of truth; the index is a searchable view over them. It is ideal for log and event lakes, document and embedding archives, and any large Parquet/Iceberg dataset you want to search without an ingest step.

Supported sources

Source	Reader	Notes
Parquet	`read_parquet` / `parquet_scan`	Single file or glob; `file_row_number` PK
CSV	`read_csv` / `read_csv_auto`	Byte-offset PK; full reader-option support
JSON / NDJSON	`read_json` / `read_ndjson` (`*_auto`)	Byte-offset PK
Iceberg	`iceberg_scan(...)` / catalog table	Snapshot at build time
Attached DuckDB	`ATTACH … ; SELECT … FROM db.schema.t`	Read through the live attachment
Text / blobs	`read_text` / `read_blob`	One document per file (or per glob entry)

Local paths and s3:// URLs (via httpfs) both work, as do globs ('…/*.parquet') spanning thousands of files across a partitioned dataset.

How it works

Define a view over the reader, then create the index on the view. The reader's row identity (file_row_number, byte offset or a (file_index, position) pair for globs) is detected automatically — no WITH (pk = ...) needed:

Query

CREATE VIEW logs AS    SELECT * FROM read_parquet('s3://my-bucket/logs/2025/*.parquet');

Attach a text search dictionary to the text columns and build the index. Columns without a dictionary are indexed verbatim:

Query

CREATE TEXT SEARCH DICTIONARY english_dict (    template = 'text',    locale = 'en_US.UTF-8',    case = 'lower',    stemming = true,    accent = false,    frequency = true,    position = true,    offset = true);

Query

CREATE INDEX logs_idx ON logs    USING inverted (id, level, service, message english_dict);

Then query the index by name, exactly like any other inverted index:

Query

-- Full-text search ranked by relevance, combined with an exact filterSELECT id, level, messageFROM logs_idxWHERE message @@ ts_phrase('out of memory') AND level @@ 'ERROR'ORDER BY BM25(logs_idx.tableoid) DESC, id;

Result

 id | level | message----+-------+------------------------------  3 | ERROR | out of memory while indexing

Only the columns the index needs are read from the files at build time — projection and predicate pruning are pushed into the reader, so indexing a wide Parquet dataset touches just the indexed columns. See the Indexing External Data cookbook recipe for an end-to-end walkthrough, and Indexing Views for the underlying fast-path, row-identity and materialization rules.

Reader parameters

Reader options are preserved: the same parameters used to build the index are replayed when columns are materialized. CSV options (delim, header, quote, nullstr, skip, compression, columns, types, dateformat, …), JSON options (format, records, columns, maximum_object_size, …) and Parquet options (binary_as_string, hive_partitioning, file_row_number, …) all round-trip.

Freshness

An external-data index is a static snapshot of the postings at CREATE INDEX time; it does not track changes to the files. When the underlying data changes, rebuild the index:

Query

DROP INDEX logs_idx;
CREATE INDEX logs_idx ON logs    USING inverted (id, level, service, message english_dict);

Materialized column values are read live from the current files, so counts and scores reflect the build-time snapshot while a materialized column reflects the file as it is now (a row removed from the source materializes as NULL).

Supported sources​

How it works​

Reader parameters​

Freshness​

See also​

Supported sources

How it works

Reader parameters

Freshness

See also