Skip to main content

Indexing External Data

SereneDB can create inverted indexes over external tables — Parquet, CSV or ORC files on local disk or S3 — enabling full-text search, phrase queries, fuzzy matching and relevance ranking without importing data into the database.

Create an external table

Define a table backed by a remote file:

CREATE TABLE logs (
id INTEGER,
timestamp TIMESTAMP,
level TEXT,
service TEXT,
message TEXT
) USING EXTERNAL WITH (
PATH = 's3://my-bucket/logs/2025/*.parquet',
FORMAT = 'Parquet',
S3_ACCESS_KEY = 'your-access-key',
S3_SECRET_KEY = 'your-secret-key',
S3_REGION = 'us-east-1'
);

Local files work the same way:

CREATE TABLE articles (
id INTEGER,
title TEXT,
body TEXT,
category TEXT
) USING EXTERNAL WITH (
PATH = '/data/articles.parquet',
FORMAT = 'Parquet'
);

Create a text search dictionary

Define how text columns should be tokenized:

CREATE TEXT SEARCH DICTIONARY english_dict (
TEMPLATE = 'text',
LOCALE = 'en_US.UTF-8',
CASE = 'lower',
STEMMING = true,
ACCENT = false,
FREQUENCY = true,
POSITION = true
);

Build an inverted index

Create the index over the external table. Columns without a dictionary are indexed as-is (exact matching); columns with a dictionary get full-text processing:

CREATE INDEX logs_search ON logs
USING inverted (id, timestamp, level, service, message english_dict);

VACUUM (UPDATE_INDEXES) logs;

Query the index

Once built, query the index by name — just like any other inverted index:

-- Full-text search
SELECT id, timestamp, message
FROM logs_search
WHERE PHRASE(message, 'connection timeout')
ORDER BY BM25() DESC
LIMIT 20;

-- Combine search with exact filters
SELECT id, timestamp, level, message
FROM logs_search
WHERE PHRASE(message, 'out of memory') AND level = 'ERROR'
ORDER BY timestamp DESC;

-- Fuzzy matching for typo tolerance
SELECT id, service, message
FROM logs_search
WHERE LEVENSHTEIN_MATCH(service, 'postgre', 1)
ORDER BY id;

Refresh the index

When the underlying files change, update the index:

VACUUM (UPDATE_INDEXES) logs;

Use cases

  • Log analysis — search through terabytes of Parquet log files on S3 without loading them into the database.
  • Data lake search — add full-text search to your existing data lake files.
  • Archival queries — index historical data that rarely changes but needs to be searchable.
  • Hybrid pipelines — combine external data with local tables in the same query.

See also