Skip to main content

Metadata

Parquet Metadata

The parquet_metadata function can be used to query the metadata contained within a Parquet file, which reveals various internal details of the Parquet file such as the statistics of the different columns. This can be useful for figuring out what kind of skipping is possible in Parquet files, or even to obtain a quick overview of what the different columns contain. The function supports glob patterns to query metadata across multiple files in parallel:

Query
SELECT *FROM parquet_metadata('test.parquet');
Result
 file_name    | row_group_id | row_group_num_rows | row_group_num_columns | row_group_bytes | column_id | file_offset | num_values | path_in_schema | type       | stats_min | stats_max | stats_null_count | stats_distinct_count | stats_min_value | stats_max_value | compression | encodings | index_page_offset | dictionary_page_offset | data_page_offset | total_compressed_size | total_uncompressed_size | key_value_metadata | bloom_filter_offset | bloom_filter_length | min_is_exact | max_is_exact | row_group_compressed_bytes | geo_bbox | geo_types--------------+--------------+--------------------+-----------------------+-----------------+-----------+-------------+------------+----------------+------------+-----------+-----------+------------------+----------------------+-----------------+-----------------+-------------+-----------+-------------------+------------------------+------------------+-----------------------+-------------------------+--------------------+---------------------+---------------------+--------------+--------------+----------------------------+----------+----------- test.parquet |            0 |                  2 |                     3 | (varies)        |         0 | (varies)    |          2 | id             | INT32      | 1         | 2         |                0 | NULL                 | 1               | 2               | SNAPPY      | PLAIN     | NULL              | NULL                   | (varies)         | (varies)              | (varies)                | {}                 | NULL                | NULL                | t            | t            | (varies)                   | NULL     | NULL test.parquet |            0 |                  2 |                     3 | (varies)        |         1 | (varies)    |          2 | name           | BYTE_ARRAY | alpha     | beta      |                0 | NULL                 | alpha           | beta            | SNAPPY      | PLAIN     | NULL              | NULL                   | (varies)         | (varies)              | (varies)                | {}                 | NULL                | NULL                | t            | t            | (varies)                   | NULL     | NULL test.parquet |            0 |                  2 |                     3 | (varies)        |         2 | (varies)    |          2 | value          | INT32      | 10        | 20        |                0 | NULL                 | 10              | 20              | SNAPPY      | PLAIN     | NULL              | NULL                   | (varies)         | (varies)              | (varies)                | {}                 | NULL                | NULL                | t            | t            | (varies)                   | NULL     | NULL
Query
SELECT *FROM parquet_metadata('data/*.parquet');
Result
 file_name           | row_group_id | row_group_num_rows | row_group_num_columns | row_group_bytes | column_id | file_offset | num_values | path_in_schema | type       | stats_min | stats_max | stats_null_count | stats_distinct_count | stats_min_value | stats_max_value | compression | encodings | index_page_offset | dictionary_page_offset | data_page_offset | total_compressed_size | total_uncompressed_size | key_value_metadata | bloom_filter_offset | bloom_filter_length | min_is_exact | max_is_exact | row_group_compressed_bytes | geo_bbox | geo_types---------------------+--------------+--------------------+-----------------------+-----------------+-----------+-------------+------------+----------------+------------+-----------+-----------+------------------+----------------------+-----------------+-----------------+-------------+-----------+-------------------+------------------------+------------------+-----------------------+-------------------------+--------------------+---------------------+---------------------+--------------+--------------+----------------------------+----------+----------- data/file_1.parquet |            0 |                  2 |                     3 | (varies)        |         0 | (varies)    |          2 | id             | INT32      | 1         | 2         |                0 | NULL                 | 1               | 2               | SNAPPY      | PLAIN     | NULL              | NULL                   | (varies)         | (varies)              | (varies)                | {}                 | NULL                | NULL                | t            | t            | (varies)                   | NULL     | NULL data/file_1.parquet |            0 |                  2 |                     3 | (varies)        |         1 | (varies)    |          2 | name           | BYTE_ARRAY | alpha     | beta      |                0 | NULL                 | alpha           | beta            | SNAPPY      | PLAIN     | NULL              | NULL                   | (varies)         | (varies)              | (varies)                | {}                 | NULL                | NULL                | t            | t            | (varies)                   | NULL     | NULL data/file_1.parquet |            0 |                  2 |                     3 | (varies)        |         2 | (varies)    |          2 | value          | INT32      | 10        | 20        |                0 | NULL                 | 10              | 20              | SNAPPY      | PLAIN     | NULL              | NULL                   | (varies)         | (varies)              | (varies)                | {}                 | NULL                | NULL                | t            | t            | (varies)                   | NULL     | NULL data/file_2.parquet |            0 |                  2 |                     3 | (varies)        |         0 | (varies)    |          2 | id             | INT32      | 1         | 2         |                0 | NULL                 | 1               | 2               | SNAPPY      | PLAIN     | NULL              | NULL                   | (varies)         | (varies)              | (varies)                | {}                 | NULL                | NULL                | t            | t            | (varies)                   | NULL     | NULL data/file_2.parquet |            0 |                  2 |                     3 | (varies)        |         1 | (varies)    |          2 | name           | BYTE_ARRAY | alpha     | beta      |                0 | NULL                 | alpha           | beta            | SNAPPY      | PLAIN     | NULL              | NULL                   | (varies)         | (varies)              | (varies)                | {}                 | NULL                | NULL                | t            | t            | (varies)                   | NULL     | NULL data/file_2.parquet |            0 |                  2 |                     3 | (varies)        |         2 | (varies)    |          2 | value          | INT32      | 10        | 20        |                0 | NULL                 | 10              | 20              | SNAPPY      | PLAIN     | NULL              | NULL                   | (varies)         | (varies)              | (varies)                | {}                 | NULL                | NULL                | t            | t            | (varies)                   | NULL     | NULL

The columns returned by parquet_metadata are:

Query
SELECT column_name, column_typeFROM (DESCRIBE SELECT * FROM parquet_metadata('test.parquet'));
Result
 column_name                | column_type----------------------------+---------------------------------------------------------------------------------------------------------------- file_name                  | VARCHAR row_group_id               | BIGINT row_group_num_rows         | BIGINT row_group_num_columns      | BIGINT row_group_bytes            | BIGINT column_id                  | BIGINT file_offset                | BIGINT num_values                 | BIGINT path_in_schema             | VARCHAR type                       | VARCHAR stats_min                  | VARCHAR stats_max                  | VARCHAR stats_null_count           | BIGINT stats_distinct_count       | BIGINT stats_min_value            | VARCHAR stats_max_value            | VARCHAR compression                | VARCHAR encodings                  | VARCHAR index_page_offset          | BIGINT dictionary_page_offset     | BIGINT data_page_offset           | BIGINT total_compressed_size      | BIGINT total_uncompressed_size    | BIGINT key_value_metadata         | MAP(BLOB, BLOB) bloom_filter_offset        | BIGINT bloom_filter_length        | BIGINT min_is_exact               | BOOLEAN max_is_exact               | BOOLEAN row_group_compressed_bytes | BIGINT geo_bbox                   | STRUCT(xmin DOUBLE, xmax DOUBLE, ymin DOUBLE, ymax DOUBLE, zmin DOUBLE, zmax DOUBLE, mmin DOUBLE, mmax DOUBLE) geo_types                  | VARCHAR[]

Parquet Schema

The parquet_schema function can be used to query the internal schema contained within a Parquet file. Note that this is the schema as it is contained within the metadata of the Parquet file. If you want to figure out the column names and types contained within a Parquet file it is easier to use DESCRIBE.

Fetch the column names and column types:

Query
DESCRIBE SELECT * FROM 'test.parquet';
Result
 column_name | column_type | null | key  | default | extra-------------+-------------+------+------+---------+------- id          | INTEGER     | YES  | NULL | NULL    | NULL name        | VARCHAR     | YES  | NULL | NULL    | NULL value       | INTEGER     | YES  | NULL | NULL    | NULL

Fetch the internal schema of a Parquet file:

Query
SELECT *FROM parquet_schema('test.parquet');
Result
 file_name    | name          | type       | type_length | repetition_type | num_children | converted_type | scale | precision | field_id | logical_type | duckdb_type | column_id--------------+---------------+------------+-------------+-----------------+--------------+----------------+-------+-----------+----------+--------------+-------------+----------- test.parquet | duckdb_schema | NULL       | NULL        | REQUIRED        |            3 | NULL           | NULL  | NULL      | NULL     | NULL         | NULL        |         0 test.parquet | id            | INT32      | NULL        | OPTIONAL        |         NULL | INT_32         | NULL  | NULL      | NULL     | NULL         | INTEGER     |         1 test.parquet | name          | BYTE_ARRAY | NULL        | OPTIONAL        |         NULL | UTF8           | NULL  | NULL      | NULL     | NULL         | VARCHAR     |         2 test.parquet | value         | INT32      | NULL        | OPTIONAL        |         NULL | INT_32         | NULL  | NULL      | NULL     | NULL         | INTEGER     |         3

The columns returned by parquet_schema are:

Query
SELECT column_name, column_typeFROM (DESCRIBE SELECT * FROM parquet_schema('test.parquet'));
Result
 column_name     | column_type-----------------+------------- file_name       | VARCHAR name            | VARCHAR type            | VARCHAR type_length     | VARCHAR repetition_type | VARCHAR num_children    | BIGINT converted_type  | VARCHAR scale           | BIGINT precision       | BIGINT field_id        | BIGINT logical_type    | VARCHAR duckdb_type     | VARCHAR column_id       | BIGINT

Parquet File Metadata

The parquet_file_metadata function can be used to query file-level metadata such as the format version and the encryption algorithm used:

Query
SELECT *FROM parquet_file_metadata('test.parquet');
Result
 file_name    | created_by | num_rows | num_row_groups | format_version | encryption_algorithm | footer_signing_key_metadata | file_size_bytes | footer_size | column_orders--------------+------------+----------+----------------+----------------+----------------------+-----------------------------+-----------------+-------------+------------------------------------------------------------------------------------------------------------------------------------ test.parquet | (varies)   |        2 |              1 |              1 | NULL                 | NULL                        | (varies)        | (varies)    | {ColumnOrder(TYPE_ORDER=TypeDefinedOrder()),ColumnOrder(TYPE_ORDER=TypeDefinedOrder()),ColumnOrder(TYPE_ORDER=TypeDefinedOrder())}

The columns returned by parquet_file_metadata are:

Query
SELECT column_name, column_typeFROM (DESCRIBE SELECT * FROM parquet_file_metadata('test.parquet'));
Result
 column_name                 | column_type-----------------------------+------------- file_name                   | VARCHAR created_by                  | VARCHAR num_rows                    | BIGINT num_row_groups              | BIGINT format_version              | BIGINT encryption_algorithm        | VARCHAR footer_signing_key_metadata | VARCHAR file_size_bytes             | UBIGINT footer_size                 | UBIGINT column_orders               | VARCHAR[]

Parquet Key-Value Metadata

The parquet_kv_metadata function can be used to query custom metadata defined as key-value pairs. Its key and value columns are returned as BLOB, so they display in hexadecimal:

Query
SELECT *FROM parquet_kv_metadata('test.parquet');
Result
 file_name    | key               | value--------------+-------------------+------------------------------------------------------------------------------------- test.parquet | \\x69735f6576656e | \\x6e6f7420276f646427 test.parquet | \\x6e756d626572   | \\x416e7377657220746f206c6966652c20756e6976657273652c20616e642065766572797468696e67

The columns returned by parquet_kv_metadata are:

Query
SELECT column_name, column_typeFROM (DESCRIBE SELECT * FROM parquet_kv_metadata('test.parquet'));
Result
 column_name | column_type-------------+------------- file_name   | VARCHAR key         | BLOB value       | BLOB

Full Metadata

The parquet_full_metadata function returns all metadata for a Parquet file in a single row, combining the results of parquet_file_metadata, parquet_metadata, parquet_schema and parquet_kv_metadata as nested struct arrays. The length of each array reflects the file's structure — here three column chunks and four schema elements:

Query
SELECT    len(parquet_metadata)    AS column_chunks,    len(parquet_schema)      AS schema_elements,    len(parquet_kv_metadata) AS kv_pairsFROM parquet_full_metadata('test.parquet');
Result
 column_chunks | schema_elements | kv_pairs---------------+-----------------+----------             3 |               4 |        0

Each struct array contains the same columns as the corresponding standalone function; run SELECT * FROM parquet_full_metadata('test.parquet') to retrieve them all.

Bloom Filters

SereneDB supports Bloom filters for pruning the row groups that need to be read to answer highly selective queries. Currently, Bloom filters are supported for the following types:

  • Integer types: TINYINT, UTINYINT, SMALLINT, USMALLINT, INTEGER, UINTEGER, BIGINT, UBIGINT
  • Floating point types: FLOAT, DOUBLE
  • VARCHAR
  • BLOB

The parquet_bloom_probe(filename, column_name, value) function shows which row groups can be excluded when filtering for a given value of a given column using the Bloom filter. For example:

Query
FROM parquet_bloom_probe('my_file.parquet', 'name', 'gamma');
Result
 file_name       | row_group_id | bloom_filter_excludes-----------------+--------------+----------------------- my_file.parquet |            0 | f