Edit this page

Indexing JSON

Semi-structured JSON documents are indexed and searched the same way as relational columns — by putting their sub-fields into an inverted index. The only real decision is the storage type: keep the data as raw JSON text, or convert it to the shredded VARIANT type.

Converting JSON to `VARIANT`

Parse JSON text into a shredded VARIANT with a two-step cast — ::JSON parses, ::VARIANT shreds:

'{"host": "web-01", "msg": "disk error"}'::JSON::VARIANT

Cast through JSON deliberately: a bare '…'::VARIANT on a string stores the text as a single string scalar, not a structured object, so sub-fields cannot be addressed. Extract a sub-field by name with the ['field'] subscript and cast it to the type you need (['msg']::VARCHAR).

Indexing a `VARIANT` sub-field

You do not index a VARIANT column as a whole — index each sub-field extraction expression you want to search, cast to a concrete type. The query repeats the same extraction, and any other sub-field still comes back without re-parsing because each is stored as its own column:

The cast is required, and it carries meaning — it is not boilerplate. A subscript like doc['msg'] is itself a VARIANT (the type travels with the value), and a VARIANT cannot be an index key. Casting to VARCHAR, INTEGER, DATE, … both makes the value indexable and chooses how it is indexed — text is analyzed for full-text search, numbers and dates get range matching. The same cast must appear in the query so the planner matches it to the index.

Query

CREATE INDEX logs_idx ON logs    USING inverted ((doc['msg']::VARCHAR) logmsg);
SELECT id, doc['host']::VARCHAR AS hostFROM logs_idxWHERE (doc['msg']::VARCHAR) @@ 'error'ORDER BY id;

Result

 id | host----+--------  1 | web-01

How each sub-field type is indexed

The cast type of the extraction decides how the sub-field behaves in the index — exactly as if that type were a top-level column. One index can mix all of them:

Sub-field	Extraction expression	Behaviour
Text	`(doc['title']::VARCHAR) dict`	Analyzed full-text search — `@@ 'term'`, phrases, fuzzy
Number	`(doc['priority']::INTEGER)`	Range matching — `ts_between`, `ts_ge`, …
Date / timestamp	`(doc['ts']::DATE)`	Range matching on the temporal value
Boolean / identifier	`(doc['active']::BOOLEAN)`, `(doc['sku']::VARCHAR)`	Verbatim (exact) match when no dictionary is attached
Array	`(doc['tags']::VARCHAR[]) dict`	Any-element match — a row matches if any element matches (see Indexing arrays)
Nested	`(doc['nested']['region']::VARCHAR)`	Address any depth by chaining subscripts

A sub-field that is absent from a row yields NULL and is simply not indexed for that row — documents need not share a uniform shape.

Nested fields

Reach into a nested object by chaining subscripts — doc['attrs']['brand'] — and cast the leaf to its type. Any depth works, and the nested extraction is indexed exactly like a top-level one. Here attrs.brand is indexed and searched as analyzed text:

Query

CREATE INDEX catalog_idx ON catalog    USING inverted (        id,        (doc['attrs']['brand']::VARCHAR) cat_dict,        (doc['tags']::VARCHAR[]) cat_dict    );
SELECT id, doc['name']::VARCHAR AS nameFROM catalog_idxWHERE (doc['attrs']['brand']::VARCHAR) @@ 'acme'ORDER BY id;

Result

 id | name----+--------  1 | Laptop  3 | Cable

Array fields

Cast an array sub-field to a typed array (::VARCHAR[]) and it is indexed element by element: a row matches if any element matches. Combine with ts_all to require every value, or ts_any for "at least N of". Using the same catalog_idx from above:

Query

SELECT id, doc['name']::VARCHAR AS nameFROM catalog_idxWHERE (doc['tags']::VARCHAR[]) @@ 'sale'ORDER BY id;
SELECT id, doc['name']::VARCHAR AS nameFROM catalog_idxWHERE (doc['tags']::VARCHAR[]) @@ ts_all(['sale', 'new'])ORDER BY id;

Result

 id | name----+--------  1 | Laptop  3 | Cable
 id | name----+--------  1 | Laptop

Assigning a tokenizer per sub-field

Each extraction takes its own dictionary in the opclass position, so different sub-fields can be analyzed differently in the same index — a stemming dictionary on a description, lower-casing on a title, and a verbatim (no-dictionary) identifier — while numeric and temporal sub-fields get range matching automatically:

Query

CREATE INDEX tickets_idx ON tickets    USING inverted (        id,        (doc['title']::VARCHAR) jlow,        (doc['priority']::INTEGER),        (doc['tags']::VARCHAR[]) jlow    );
SELECT id, doc['title']::VARCHAR AS titleFROM tickets_idxWHERE (doc['title']::VARCHAR) @@ 'error'  AND (doc['priority']::INTEGER) @@ ts_ge(3)  AND (doc['tags']::VARCHAR[]) @@ 'urgent'ORDER BY id;

Result

 id | title----+------------  1 | Disk Error

Returning a payload without searching it

To keep a VARIANT only for retrieval, INCLUDE the column so it is returned straight from the index's columnstore alongside a text column you do search — no extraction needed:

Query

CREATE INDEX documents_idx ON documents    USING inverted (id, body en) INCLUDE (payload);
SELECT id, payload FROM documents_idx WHERE body @@ 'cat' ORDER BY id;

Result

 id | payload----+--------------  1 | {"views":42}

Indexing raw JSON text

When the column must stay JSON, index a JSON extraction expression such as doc ->> 'host' with a dictionary, and repeat the same extraction in the query. This works, but every match re-parses the JSON text and pays the whitespace/round-trip pitfalls that VARIANT avoids:

Query

CREATE INDEX events_idx ON events    USING inverted (id, (doc ->> 'host') kw);
SELECT id, doc ->> 'host' AS host FROM events_idx WHERE doc ->> 'host' @@ 'web-02' ORDER BY id;

Result

 id | host----+--------  2 | web-02

Indexing a `JSON` column as `VARIANT`

You do not have to migrate the column to VARIANT to get shredded, typed sub-field indexing — cast to VARIANT inside the index expression. The column stays JSON (a JSON-typed column casts straight with ::VARIANT; a VARCHAR holding JSON needs ::JSON::VARIANT first), while the index extracts typed sub-fields just as a real VARIANT column would:

Query

CREATE INDEX events_json_idx ON events_json    USING inverted (        id,        ((doc::VARIANT)['title']::VARCHAR) jtext,        ((doc::VARIANT)['priority']::INTEGER)    );
SELECT id, (doc::VARIANT)['title']::VARCHAR AS titleFROM events_json_idxWHERE ((doc::VARIANT)['title']::VARCHAR) @@ 'error'  AND ((doc::VARIANT)['priority']::INTEGER) @@ ts_ge(3)ORDER BY id;

Result

 id | title----+------------  1 | Disk Error

This is the best of both worlds when you cannot change the schema: the stored bytes remain JSON, but each indexed sub-field is shredded and typed. The cast happens once per row at index-build time, not on every search. As always, the query must repeat the exact indexed expression — including the ::VARIANT cast.

Converting JSON to VARIANT​

Indexing a VARIANT sub-field​

How each sub-field type is indexed​

Nested fields​

Array fields​

Assigning a tokenizer per sub-field​

Returning a payload without searching it​

Indexing raw JSON text​

Indexing a JSON column as VARIANT​

See also​

Converting JSON to `VARIANT`

Indexing a `VARIANT` sub-field

How each sub-field type is indexed

Nested fields

Array fields

Assigning a tokenizer per sub-field

Returning a payload without searching it

Indexing raw JSON text

Indexing a `JSON` column as `VARIANT`

See also