Indexing JSON
Semi-structured JSON documents are indexed and searched the same way as relational columns — by putting their sub-fields into an inverted index. The only real decision is the storage type: keep the data as raw JSON text, or convert it to the shredded VARIANT type.
Converting JSON to VARIANT
Parse JSON text into a shredded VARIANT with a two-step cast — ::JSON parses, ::VARIANT shreds:
'{"host": "web-01", "msg": "disk error"}'::JSON::VARIANT
Cast through JSON deliberately: a bare '…'::VARIANT on a string stores the text as a single string scalar, not a structured object, so sub-fields cannot be addressed. Extract a sub-field by name with the ['field'] subscript and cast it to the type you need (['msg']::VARCHAR).
Indexing a VARIANT sub-field
You do not index a VARIANT column as a whole — index each sub-field extraction expression you want to search, cast to a concrete type. The query repeats the same extraction, and any other sub-field still comes back without re-parsing because each is stored as its own column:
The cast is required, and it carries meaning — it is not boilerplate. A subscript like doc['msg'] is itself a VARIANT (the type travels with the value), and a VARIANT cannot be an index key. Casting to VARCHAR, INTEGER, DATE, … both makes the value indexable and chooses how it is indexed — text is analyzed for full-text search, numbers and dates get range matching. The same cast must appear in the query so the planner matches it to the index.
CREATE INDEX logs_idx ON logs USING inverted ((doc['msg']::VARCHAR) logmsg);
SELECT id, doc['host']::VARCHAR AS hostFROM logs_idxWHERE (doc['msg']::VARCHAR) @@ 'error'ORDER BY id; id | host----+-------- 1 | web-01How each sub-field type is indexed
The cast type of the extraction decides how the sub-field behaves in the index — exactly as if that type were a top-level column. One index can mix all of them:
| Sub-field | Extraction expression | Behaviour |
|---|---|---|
| Text | (doc['title']::VARCHAR) dict | Analyzed full-text search — @@ 'term', phrases, fuzzy |
| Number | (doc['priority']::INTEGER) | Range matching — ts_between, ts_ge, … |
| Date / timestamp | (doc['ts']::DATE) | Range matching on the temporal value |
| Boolean / identifier | (doc['active']::BOOLEAN), (doc['sku']::VARCHAR) | Verbatim (exact) match when no dictionary is attached |
| Array | (doc['tags']::VARCHAR[]) dict | Any-element match — a row matches if any element matches (see Indexing arrays) |
| Nested | (doc['nested']['region']::VARCHAR) | Address any depth by chaining subscripts |
A sub-field that is absent from a row yields NULL and is simply not indexed for that row — documents need not share a uniform shape.
Nested fields
Reach into a nested object by chaining subscripts — doc['attrs']['brand'] — and cast the leaf to its type. Any depth works, and the nested extraction is indexed exactly like a top-level one. Here attrs.brand is indexed and searched as analyzed text:
CREATE INDEX catalog_idx ON catalog USING inverted ( id, (doc['attrs']['brand']::VARCHAR) cat_dict, (doc['tags']::VARCHAR[]) cat_dict );
SELECT id, doc['name']::VARCHAR AS nameFROM catalog_idxWHERE (doc['attrs']['brand']::VARCHAR) @@ 'acme'ORDER BY id; id | name----+-------- 1 | Laptop 3 | CableArray fields
Cast an array sub-field to a typed array (::VARCHAR[]) and it is indexed element by element: a row matches if any element matches. Combine with ts_all to require every value, or ts_any for "at least N of". Using the same catalog_idx from above:
SELECT id, doc['name']::VARCHAR AS nameFROM catalog_idxWHERE (doc['tags']::VARCHAR[]) @@ 'sale'ORDER BY id;
SELECT id, doc['name']::VARCHAR AS nameFROM catalog_idxWHERE (doc['tags']::VARCHAR[]) @@ ts_all(['sale', 'new'])ORDER BY id; id | name----+-------- 1 | Laptop 3 | Cable
id | name----+-------- 1 | LaptopAssigning a tokenizer per sub-field
Each extraction takes its own dictionary in the opclass position, so different sub-fields can be analyzed differently in the same index — a stemming dictionary on a description, lower-casing on a title, and a verbatim (no-dictionary) identifier — while numeric and temporal sub-fields get range matching automatically:
CREATE INDEX tickets_idx ON tickets USING inverted ( id, (doc['title']::VARCHAR) jlow, (doc['priority']::INTEGER), (doc['tags']::VARCHAR[]) jlow );
SELECT id, doc['title']::VARCHAR AS titleFROM tickets_idxWHERE (doc['title']::VARCHAR) @@ 'error' AND (doc['priority']::INTEGER) @@ ts_ge(3) AND (doc['tags']::VARCHAR[]) @@ 'urgent'ORDER BY id; id | title----+------------ 1 | Disk ErrorReturning a payload without searching it
To keep a VARIANT only for retrieval, INCLUDE the column so it is returned straight from the index's columnstore alongside a text column you do search — no extraction needed:
CREATE INDEX documents_idx ON documents USING inverted (id, body en) INCLUDE (payload);
SELECT id, payload FROM documents_idx WHERE body @@ 'cat' ORDER BY id; id | payload----+-------------- 1 | {"views":42}Indexing raw JSON text
When the column must stay JSON, index a JSON extraction expression such as doc ->> 'host' with a dictionary, and repeat the same extraction in the query. This works, but every match re-parses the JSON text and pays the whitespace/round-trip pitfalls that VARIANT avoids:
CREATE INDEX events_idx ON events USING inverted (id, (doc ->> 'host') kw);
SELECT id, doc ->> 'host' AS host FROM events_idx WHERE doc ->> 'host' @@ 'web-02' ORDER BY id; id | host----+-------- 2 | web-02Indexing a JSON column as VARIANT
You do not have to migrate the column to VARIANT to get shredded, typed sub-field indexing — cast to VARIANT inside the index expression. The column stays JSON (a JSON-typed column casts straight with ::VARIANT; a VARCHAR holding JSON needs ::JSON::VARIANT first), while the index extracts typed sub-fields just as a real VARIANT column would:
CREATE INDEX events_json_idx ON events_json USING inverted ( id, ((doc::VARIANT)['title']::VARCHAR) jtext, ((doc::VARIANT)['priority']::INTEGER) );
SELECT id, (doc::VARIANT)['title']::VARCHAR AS titleFROM events_json_idxWHERE ((doc::VARIANT)['title']::VARCHAR) @@ 'error' AND ((doc::VARIANT)['priority']::INTEGER) @@ ts_ge(3)ORDER BY id; id | title----+------------ 1 | Disk ErrorThis is the best of both worlds when you cannot change the schema: the stored bytes remain JSON, but each indexed sub-field is shredded and typed. The cast happens once per row at index-build time, not on every search. As always, the query must repeat the exact indexed expression — including the ::VARIANT cast.
See also
- What to Index — columns, expressions, arrays and sizing
- Text Analysis — dictionaries and tokenizers
- Full-Text Search ·
VARIANTtype · JSON overview