Metadata
Parquet Metadata
The parquet_metadata function can be used to query the metadata contained within a Parquet file, which reveals various internal details of the Parquet file such as the statistics of the different columns. This can be useful for figuring out what kind of skipping is possible in Parquet files, or even to obtain a quick overview of what the different columns contain. The function supports glob patterns to query metadata across multiple files in parallel:
SELECT *FROM parquet_metadata('test.parquet'); file_name | row_group_id | row_group_num_rows | row_group_num_columns | row_group_bytes | column_id | file_offset | num_values | path_in_schema | type | stats_min | stats_max | stats_null_count | stats_distinct_count | stats_min_value | stats_max_value | compression | encodings | index_page_offset | dictionary_page_offset | data_page_offset | total_compressed_size | total_uncompressed_size | key_value_metadata | bloom_filter_offset | bloom_filter_length | min_is_exact | max_is_exact | row_group_compressed_bytes | geo_bbox | geo_types--------------+--------------+--------------------+-----------------------+-----------------+-----------+-------------+------------+----------------+------------+-----------+-----------+------------------+----------------------+-----------------+-----------------+-------------+-----------+-------------------+------------------------+------------------+-----------------------+-------------------------+--------------------+---------------------+---------------------+--------------+--------------+----------------------------+----------+----------- test.parquet | 0 | 2 | 3 | (varies) | 0 | (varies) | 2 | id | INT32 | 1 | 2 | 0 | NULL | 1 | 2 | SNAPPY | PLAIN | NULL | NULL | (varies) | (varies) | (varies) | {} | NULL | NULL | t | t | (varies) | NULL | NULL test.parquet | 0 | 2 | 3 | (varies) | 1 | (varies) | 2 | name | BYTE_ARRAY | alpha | beta | 0 | NULL | alpha | beta | SNAPPY | PLAIN | NULL | NULL | (varies) | (varies) | (varies) | {} | NULL | NULL | t | t | (varies) | NULL | NULL test.parquet | 0 | 2 | 3 | (varies) | 2 | (varies) | 2 | value | INT32 | 10 | 20 | 0 | NULL | 10 | 20 | SNAPPY | PLAIN | NULL | NULL | (varies) | (varies) | (varies) | {} | NULL | NULL | t | t | (varies) | NULL | NULLSELECT *FROM parquet_metadata('data/*.parquet'); file_name | row_group_id | row_group_num_rows | row_group_num_columns | row_group_bytes | column_id | file_offset | num_values | path_in_schema | type | stats_min | stats_max | stats_null_count | stats_distinct_count | stats_min_value | stats_max_value | compression | encodings | index_page_offset | dictionary_page_offset | data_page_offset | total_compressed_size | total_uncompressed_size | key_value_metadata | bloom_filter_offset | bloom_filter_length | min_is_exact | max_is_exact | row_group_compressed_bytes | geo_bbox | geo_types---------------------+--------------+--------------------+-----------------------+-----------------+-----------+-------------+------------+----------------+------------+-----------+-----------+------------------+----------------------+-----------------+-----------------+-------------+-----------+-------------------+------------------------+------------------+-----------------------+-------------------------+--------------------+---------------------+---------------------+--------------+--------------+----------------------------+----------+----------- data/file_1.parquet | 0 | 2 | 3 | (varies) | 0 | (varies) | 2 | id | INT32 | 1 | 2 | 0 | NULL | 1 | 2 | SNAPPY | PLAIN | NULL | NULL | (varies) | (varies) | (varies) | {} | NULL | NULL | t | t | (varies) | NULL | NULL data/file_1.parquet | 0 | 2 | 3 | (varies) | 1 | (varies) | 2 | name | BYTE_ARRAY | alpha | beta | 0 | NULL | alpha | beta | SNAPPY | PLAIN | NULL | NULL | (varies) | (varies) | (varies) | {} | NULL | NULL | t | t | (varies) | NULL | NULL data/file_1.parquet | 0 | 2 | 3 | (varies) | 2 | (varies) | 2 | value | INT32 | 10 | 20 | 0 | NULL | 10 | 20 | SNAPPY | PLAIN | NULL | NULL | (varies) | (varies) | (varies) | {} | NULL | NULL | t | t | (varies) | NULL | NULL data/file_2.parquet | 0 | 2 | 3 | (varies) | 0 | (varies) | 2 | id | INT32 | 1 | 2 | 0 | NULL | 1 | 2 | SNAPPY | PLAIN | NULL | NULL | (varies) | (varies) | (varies) | {} | NULL | NULL | t | t | (varies) | NULL | NULL data/file_2.parquet | 0 | 2 | 3 | (varies) | 1 | (varies) | 2 | name | BYTE_ARRAY | alpha | beta | 0 | NULL | alpha | beta | SNAPPY | PLAIN | NULL | NULL | (varies) | (varies) | (varies) | {} | NULL | NULL | t | t | (varies) | NULL | NULL data/file_2.parquet | 0 | 2 | 3 | (varies) | 2 | (varies) | 2 | value | INT32 | 10 | 20 | 0 | NULL | 10 | 20 | SNAPPY | PLAIN | NULL | NULL | (varies) | (varies) | (varies) | {} | NULL | NULL | t | t | (varies) | NULL | NULLThe columns returned by parquet_metadata are:
SELECT column_name, column_typeFROM (DESCRIBE SELECT * FROM parquet_metadata('test.parquet')); column_name | column_type----------------------------+---------------------------------------------------------------------------------------------------------------- file_name | VARCHAR row_group_id | BIGINT row_group_num_rows | BIGINT row_group_num_columns | BIGINT row_group_bytes | BIGINT column_id | BIGINT file_offset | BIGINT num_values | BIGINT path_in_schema | VARCHAR type | VARCHAR stats_min | VARCHAR stats_max | VARCHAR stats_null_count | BIGINT stats_distinct_count | BIGINT stats_min_value | VARCHAR stats_max_value | VARCHAR compression | VARCHAR encodings | VARCHAR index_page_offset | BIGINT dictionary_page_offset | BIGINT data_page_offset | BIGINT total_compressed_size | BIGINT total_uncompressed_size | BIGINT key_value_metadata | MAP(BLOB, BLOB) bloom_filter_offset | BIGINT bloom_filter_length | BIGINT min_is_exact | BOOLEAN max_is_exact | BOOLEAN row_group_compressed_bytes | BIGINT geo_bbox | STRUCT(xmin DOUBLE, xmax DOUBLE, ymin DOUBLE, ymax DOUBLE, zmin DOUBLE, zmax DOUBLE, mmin DOUBLE, mmax DOUBLE) geo_types | VARCHAR[]Parquet Schema
The parquet_schema function can be used to query the internal schema contained within a Parquet file. Note that this is the schema as it is contained within the metadata of the Parquet file. If you want to figure out the column names and types contained within a Parquet file it is easier to use DESCRIBE.
Fetch the column names and column types:
DESCRIBE SELECT * FROM 'test.parquet'; column_name | column_type | null | key | default | extra-------------+-------------+------+------+---------+------- id | INTEGER | YES | NULL | NULL | NULL name | VARCHAR | YES | NULL | NULL | NULL value | INTEGER | YES | NULL | NULL | NULLFetch the internal schema of a Parquet file:
SELECT *FROM parquet_schema('test.parquet'); file_name | name | type | type_length | repetition_type | num_children | converted_type | scale | precision | field_id | logical_type | duckdb_type | column_id--------------+---------------+------------+-------------+-----------------+--------------+----------------+-------+-----------+----------+--------------+-------------+----------- test.parquet | duckdb_schema | NULL | NULL | REQUIRED | 3 | NULL | NULL | NULL | NULL | NULL | NULL | 0 test.parquet | id | INT32 | NULL | OPTIONAL | NULL | INT_32 | NULL | NULL | NULL | NULL | INTEGER | 1 test.parquet | name | BYTE_ARRAY | NULL | OPTIONAL | NULL | UTF8 | NULL | NULL | NULL | NULL | VARCHAR | 2 test.parquet | value | INT32 | NULL | OPTIONAL | NULL | INT_32 | NULL | NULL | NULL | NULL | INTEGER | 3The columns returned by parquet_schema are:
SELECT column_name, column_typeFROM (DESCRIBE SELECT * FROM parquet_schema('test.parquet')); column_name | column_type-----------------+------------- file_name | VARCHAR name | VARCHAR type | VARCHAR type_length | VARCHAR repetition_type | VARCHAR num_children | BIGINT converted_type | VARCHAR scale | BIGINT precision | BIGINT field_id | BIGINT logical_type | VARCHAR duckdb_type | VARCHAR column_id | BIGINTParquet File Metadata
The parquet_file_metadata function can be used to query file-level metadata such as the format version and the encryption algorithm used:
SELECT *FROM parquet_file_metadata('test.parquet'); file_name | created_by | num_rows | num_row_groups | format_version | encryption_algorithm | footer_signing_key_metadata | file_size_bytes | footer_size | column_orders--------------+------------+----------+----------------+----------------+----------------------+-----------------------------+-----------------+-------------+------------------------------------------------------------------------------------------------------------------------------------ test.parquet | (varies) | 2 | 1 | 1 | NULL | NULL | (varies) | (varies) | {ColumnOrder(TYPE_ORDER=TypeDefinedOrder()),ColumnOrder(TYPE_ORDER=TypeDefinedOrder()),ColumnOrder(TYPE_ORDER=TypeDefinedOrder())}The columns returned by parquet_file_metadata are:
SELECT column_name, column_typeFROM (DESCRIBE SELECT * FROM parquet_file_metadata('test.parquet')); column_name | column_type-----------------------------+------------- file_name | VARCHAR created_by | VARCHAR num_rows | BIGINT num_row_groups | BIGINT format_version | BIGINT encryption_algorithm | VARCHAR footer_signing_key_metadata | VARCHAR file_size_bytes | UBIGINT footer_size | UBIGINT column_orders | VARCHAR[]Parquet Key-Value Metadata
The parquet_kv_metadata function can be used to query custom metadata defined as key-value pairs. Its key and value columns are returned as BLOB, so they display in hexadecimal:
SELECT *FROM parquet_kv_metadata('test.parquet'); file_name | key | value--------------+-------------------+------------------------------------------------------------------------------------- test.parquet | \\x69735f6576656e | \\x6e6f7420276f646427 test.parquet | \\x6e756d626572 | \\x416e7377657220746f206c6966652c20756e6976657273652c20616e642065766572797468696e67The columns returned by parquet_kv_metadata are:
SELECT column_name, column_typeFROM (DESCRIBE SELECT * FROM parquet_kv_metadata('test.parquet')); column_name | column_type-------------+------------- file_name | VARCHAR key | BLOB value | BLOBFull Metadata
The parquet_full_metadata function returns all metadata for a Parquet file in a single row, combining the results of parquet_file_metadata, parquet_metadata, parquet_schema and parquet_kv_metadata as nested struct arrays. The length of each array reflects the file's structure — here three column chunks and four schema elements:
SELECT len(parquet_metadata) AS column_chunks, len(parquet_schema) AS schema_elements, len(parquet_kv_metadata) AS kv_pairsFROM parquet_full_metadata('test.parquet'); column_chunks | schema_elements | kv_pairs---------------+-----------------+---------- 3 | 4 | 0Each struct array contains the same columns as the corresponding standalone function; run SELECT * FROM parquet_full_metadata('test.parquet') to retrieve them all.
Bloom Filters
SereneDB supports Bloom filters for pruning the row groups that need to be read to answer highly selective queries. Currently, Bloom filters are supported for the following types:
- Integer types:
TINYINT,UTINYINT,SMALLINT,USMALLINT,INTEGER,UINTEGER,BIGINT,UBIGINT - Floating point types:
FLOAT,DOUBLE VARCHARBLOB
The parquet_bloom_probe(filename, column_name, value) function shows which row groups can be excluded when filtering for a given value of a given column using the Bloom filter.
For example:
FROM parquet_bloom_probe('my_file.parquet', 'name', 'gamma'); file_name | row_group_id | bloom_filter_excludes-----------------+--------------+----------------------- my_file.parquet | 0 | f