Skip to main content

Indexing JSON

Semi-structured JSON documents are indexed and searched the same way as relational columns — by putting their sub-fields into an inverted index. The only real decision is the storage type: keep the data as raw JSON text, or convert it to the shredded VARIANT type.

Converting JSON to VARIANT

Parse JSON text into a shredded VARIANT with a two-step cast — ::JSON parses, ::VARIANT shreds:

'{"host": "web-01", "msg": "disk error"}'::JSON::VARIANT

Cast through JSON deliberately: a bare '…'::VARIANT on a string stores the text as a single string scalar, not a structured object, so sub-fields cannot be addressed. Extract a sub-field by name with the ['field'] subscript and cast it to the type you need (['msg']::VARCHAR).

Indexing a VARIANT sub-field

You do not index a VARIANT column as a whole — index each sub-field extraction expression you want to search, cast to a concrete type. The query repeats the same extraction, and any other sub-field still comes back without re-parsing because each is stored as its own column:

The cast is required, and it carries meaning — it is not boilerplate. A subscript like doc['msg'] is itself a VARIANT (the type travels with the value), and a VARIANT cannot be an index key. Casting to VARCHAR, INTEGER, DATE, … both makes the value indexable and chooses how it is indexed — text is analyzed for full-text search, numbers and dates get range matching. The same cast must appear in the query so the planner matches it to the index.

Query
CREATE INDEX logs_idx ON logs    USING inverted ((doc['msg']::VARCHAR) logmsg);
SELECT id, doc['host']::VARCHAR AS hostFROM logs_idxWHERE (doc['msg']::VARCHAR) @@ 'error'ORDER BY id;
Result
 id | host----+--------  1 | web-01

How each sub-field type is indexed

The cast type of the extraction decides how the sub-field behaves in the index — exactly as if that type were a top-level column. One index can mix all of them:

Sub-fieldExtraction expressionBehaviour
Text(doc['title']::VARCHAR) dictAnalyzed full-text search@@ 'term', phrases, fuzzy
Number(doc['priority']::INTEGER)Range matchingts_between, ts_ge, …
Date / timestamp(doc['ts']::DATE)Range matching on the temporal value
Boolean / identifier(doc['active']::BOOLEAN), (doc['sku']::VARCHAR)Verbatim (exact) match when no dictionary is attached
Array(doc['tags']::VARCHAR[]) dictAny-element match — a row matches if any element matches (see Indexing arrays)
Nested(doc['nested']['region']::VARCHAR)Address any depth by chaining subscripts

A sub-field that is absent from a row yields NULL and is simply not indexed for that row — documents need not share a uniform shape.

Nested fields

Reach into a nested object by chaining subscripts — doc['attrs']['brand'] — and cast the leaf to its type. Any depth works, and the nested extraction is indexed exactly like a top-level one. Here attrs.brand is indexed and searched as analyzed text:

Query
CREATE INDEX catalog_idx ON catalog    USING inverted (        id,        (doc['attrs']['brand']::VARCHAR) cat_dict,        (doc['tags']::VARCHAR[]) cat_dict    );
SELECT id, doc['name']::VARCHAR AS nameFROM catalog_idxWHERE (doc['attrs']['brand']::VARCHAR) @@ 'acme'ORDER BY id;
Result
 id | name----+--------  1 | Laptop  3 | Cable

Array fields

Cast an array sub-field to a typed array (::VARCHAR[]) and it is indexed element by element: a row matches if any element matches. Combine with ts_all to require every value, or ts_any for "at least N of". Using the same catalog_idx from above:

Query
SELECT id, doc['name']::VARCHAR AS nameFROM catalog_idxWHERE (doc['tags']::VARCHAR[]) @@ 'sale'ORDER BY id;
SELECT id, doc['name']::VARCHAR AS nameFROM catalog_idxWHERE (doc['tags']::VARCHAR[]) @@ ts_all(['sale', 'new'])ORDER BY id;
Result
 id | name----+--------  1 | Laptop  3 | Cable
 id | name----+--------  1 | Laptop

Assigning a tokenizer per sub-field

Each extraction takes its own dictionary in the opclass position, so different sub-fields can be analyzed differently in the same index — a stemming dictionary on a description, lower-casing on a title, and a verbatim (no-dictionary) identifier — while numeric and temporal sub-fields get range matching automatically:

Query
CREATE INDEX tickets_idx ON tickets    USING inverted (        id,        (doc['title']::VARCHAR) jlow,        (doc['priority']::INTEGER),        (doc['tags']::VARCHAR[]) jlow    );
SELECT id, doc['title']::VARCHAR AS titleFROM tickets_idxWHERE (doc['title']::VARCHAR) @@ 'error'  AND (doc['priority']::INTEGER) @@ ts_ge(3)  AND (doc['tags']::VARCHAR[]) @@ 'urgent'ORDER BY id;
Result
 id | title----+------------  1 | Disk Error

Returning a payload without searching it

To keep a VARIANT only for retrieval, INCLUDE the column so it is returned straight from the index's columnstore alongside a text column you do search — no extraction needed:

Query
CREATE INDEX documents_idx ON documents    USING inverted (id, body en) INCLUDE (payload);
SELECT id, payload FROM documents_idx WHERE body @@ 'cat' ORDER BY id;
Result
 id | payload----+--------------  1 | {"views":42}

Indexing raw JSON text

When the column must stay JSON, index a JSON extraction expression such as doc ->> 'host' with a dictionary, and repeat the same extraction in the query. This works, but every match re-parses the JSON text and pays the whitespace/round-trip pitfalls that VARIANT avoids:

Query
CREATE INDEX events_idx ON events    USING inverted (id, (doc ->> 'host') kw);
SELECT id, doc ->> 'host' AS host FROM events_idx WHERE doc ->> 'host' @@ 'web-02' ORDER BY id;
Result
 id | host----+--------  2 | web-02

Indexing a JSON column as VARIANT

You do not have to migrate the column to VARIANT to get shredded, typed sub-field indexing — cast to VARIANT inside the index expression. The column stays JSON (a JSON-typed column casts straight with ::VARIANT; a VARCHAR holding JSON needs ::JSON::VARIANT first), while the index extracts typed sub-fields just as a real VARIANT column would:

Query
CREATE INDEX events_json_idx ON events_json    USING inverted (        id,        ((doc::VARIANT)['title']::VARCHAR) jtext,        ((doc::VARIANT)['priority']::INTEGER)    );
SELECT id, (doc::VARIANT)['title']::VARCHAR AS titleFROM events_json_idxWHERE ((doc::VARIANT)['title']::VARCHAR) @@ 'error'  AND ((doc::VARIANT)['priority']::INTEGER) @@ ts_ge(3)ORDER BY id;
Result
 id | title----+------------  1 | Disk Error

This is the best of both worlds when you cannot change the schema: the stored bytes remain JSON, but each indexed sub-field is shredded and typed. The cast happens once per row at index-build time, not on every search. As always, the query must repeat the exact indexed expression — including the ::VARIANT cast.

See also