Overview
Examples
Read a single Parquet file:
SELECT * FROM 'test.parquet'; id | name | value----+-------+------- 1 | alpha | 10 2 | beta | 20Figure out which columns/types are in a Parquet file:
DESCRIBE SELECT * FROM 'test.parquet'; column_name | column_type | null | key | default | extra-------------+-------------+------+------+---------+------- id | INTEGER | YES | NULL | NULL | NULL name | VARCHAR | YES | NULL | NULL | NULL value | INTEGER | YES | NULL | NULL | NULLCreate a table from a Parquet file:
CREATE TABLE test AS SELECT * FROM 'test.parquet';If the file does not end in .parquet, use the read_parquet function:
SELECT *FROM read_parquet('test.parq'); id | name | value----+-------+------- 1 | alpha | 10 2 | beta | 20Use list parameter to read three Parquet files and treat them as a single table:
SELECT *FROM read_parquet(['file1.parquet', 'file2.parquet', 'file3.parquet']); id | name | value----+-------+------- 1 | alpha | 10 2 | beta | 20 1 | alpha | 10 2 | beta | 20 1 | alpha | 10 2 | beta | 20Read all files that match the glob pattern:
SELECT *FROM 'test/*.parquet'; id | name | value----+-------+------- 1 | alpha | 10 2 | beta | 20 1 | alpha | 10 2 | beta | 20Read all files that match the glob pattern, and include the filename virtual column that specifies which file each row came from:
SELECT *, filenameFROM read_parquet('test/*.parquet'); id | name | value | filename----+-------+-------+------------------------ 1 | alpha | 10 | test/parquet_1.parquet 2 | beta | 20 | test/parquet_1.parquet 1 | alpha | 10 | test/parquet_2.parquet 2 | beta | 20 | test/parquet_2.parquetUse a list of globs to read all Parquet files from two specific folders:
SELECT *FROM read_parquet(['folder1/*.parquet', 'folder2/*.parquet']); id | name | value----+-------+------- 1 | alpha | 10 2 | beta | 20 1 | alpha | 10 2 | beta | 20 1 | alpha | 10 2 | beta | 20 1 | alpha | 10 2 | beta | 20Read over HTTPS:
SELECT *FROM read_parquet('https://some.url/some_file.parquet'); id | name | value----+-------+------- 1 | alpha | 10 2 | beta | 20Query the metadata of a Parquet file:
SELECT *FROM parquet_metadata('test.parquet'); file_name | row_group_id | row_group_num_rows | row_group_num_columns | row_group_bytes | column_id | file_offset | num_values | path_in_schema | type | stats_min | stats_max | stats_null_count | stats_distinct_count | stats_min_value | stats_max_value | compression | encodings | index_page_offset | dictionary_page_offset | data_page_offset | total_compressed_size | total_uncompressed_size | key_value_metadata | bloom_filter_offset | bloom_filter_length | min_is_exact | max_is_exact | row_group_compressed_bytes | geo_bbox | geo_types--------------+--------------+--------------------+-----------------------+-----------------+-----------+-------------+------------+----------------+------------+-----------+-----------+------------------+----------------------+-----------------+-----------------+-------------+-----------+-------------------+------------------------+------------------+-----------------------+-------------------------+--------------------+---------------------+---------------------+--------------+--------------+----------------------------+----------+----------- test.parquet | 0 | 2 | 3 | (varies) | 0 | (varies) | 2 | id | INT32 | 1 | 2 | 0 | NULL | 1 | 2 | SNAPPY | PLAIN | NULL | NULL | (varies) | (varies) | (varies) | {} | NULL | NULL | t | t | (varies) | NULL | NULL test.parquet | 0 | 2 | 3 | (varies) | 1 | (varies) | 2 | name | BYTE_ARRAY | alpha | beta | 0 | NULL | alpha | beta | SNAPPY | PLAIN | NULL | NULL | (varies) | (varies) | (varies) | {} | NULL | NULL | t | t | (varies) | NULL | NULL test.parquet | 0 | 2 | 3 | (varies) | 2 | (varies) | 2 | value | INT32 | 10 | 20 | 0 | NULL | 10 | 20 | SNAPPY | PLAIN | NULL | NULL | (varies) | (varies) | (varies) | {} | NULL | NULL | t | t | (varies) | NULL | NULLQuery the file metadata of a Parquet file:
SELECT *FROM parquet_file_metadata('test.parquet'); file_name | created_by | num_rows | num_row_groups | format_version | encryption_algorithm | footer_signing_key_metadata | file_size_bytes | footer_size | column_orders--------------+------------+----------+----------------+----------------+----------------------+-----------------------------+-----------------+-------------+------------------------------------------------------------------------------------------------------------------------------------ test.parquet | (varies) | 2 | 1 | 1 | NULL | NULL | (varies) | (varies) | {ColumnOrder(TYPE_ORDER=TypeDefinedOrder()),ColumnOrder(TYPE_ORDER=TypeDefinedOrder()),ColumnOrder(TYPE_ORDER=TypeDefinedOrder())}Query the key-value metadata of a Parquet file:
SELECT *FROM parquet_kv_metadata('test.parquet'); file_name | key | value--------------+-------------------+------------------------------------------------------------------------------------- test.parquet | \\x69735f6576656e | \\x6e6f7420276f646427 test.parquet | \\x6e756d626572 | \\x416e7377657220746f206c6966652c20756e6976657273652c20616e642065766572797468696e67Query the schema of a Parquet file:
SELECT *FROM parquet_schema('test.parquet'); file_name | name | type | type_length | repetition_type | num_children | converted_type | scale | precision | field_id | logical_type | duckdb_type | column_id--------------+---------------+------------+-------------+-----------------+--------------+----------------+-------+-----------+----------+--------------+-------------+----------- test.parquet | duckdb_schema | NULL | NULL | REQUIRED | 3 | NULL | NULL | NULL | NULL | NULL | NULL | 0 test.parquet | id | INT32 | NULL | OPTIONAL | NULL | INT_32 | NULL | NULL | NULL | NULL | INTEGER | 1 test.parquet | name | BYTE_ARRAY | NULL | OPTIONAL | NULL | UTF8 | NULL | NULL | NULL | NULL | VARCHAR | 2 test.parquet | value | INT32 | NULL | OPTIONAL | NULL | INT_32 | NULL | NULL | NULL | NULL | INTEGER | 3Write the results of a query to a Parquet file using the default compression (Snappy):
COPY (SELECT * FROM tbl) TO 'result-snappy.parquet' (FORMAT parquet);Write the results from a query to a Parquet file with specific compression and row group size:
COPY (FROM generate_series(100_000)) TO 'test.parquet' (FORMAT parquet, COMPRESSION zstd, ROW_GROUP_SIZE 100_000);Export the table contents of the entire database as parquet:
EXPORT DATABASE 'target_directory' (FORMAT parquet);Parquet Files
Parquet files are compressed columnar files that are efficient to load and process. SereneDB provides support for both reading and writing Parquet files in an efficient manner, as well as support for pushing filters and projections into the Parquet file scans.
read_parquet Function
| Function | Description | Example |
|---|---|---|
read_parquet(path_or_list_of_paths) | Read Parquet file(s) | SELECT * FROM read_parquet('test.parquet'); |
parquet_scan(path_or_list_of_paths) | Alias for read_parquet | SELECT * FROM parquet_scan('test.parquet'); |
If your file ends in .parquet, the function syntax is optional. The system will automatically infer that you are reading a Parquet file:
SELECT * FROM 'test.parquet'; id | name | value----+-------+------- 1 | alpha | 10 2 | beta | 20Multiple files can be read at once by providing a glob or a list of files. Refer to the multiple files section for more information.
Parameters
There are a number of options exposed that can be passed to the read_parquet function or the COPY statement.
| Name | Description | Type | Default |
|---|---|---|---|
binary_as_string | Parquet files generated by legacy writers do not correctly set the UTF8 flag for strings, causing string columns to be loaded as BLOB instead. Set this to true to load binary columns as strings. | BOOL | false |
encryption_config | Configuration for Parquet encryption. | STRUCT | - |
filename | Whether or not an extra filename column should be included in the result. The filename column is added automatically as a virtual column and this option is only kept for compatibility reasons. | BOOL | false |
file_row_number | Whether or not to include the file_row_number column. | BOOL | false |
hive_partitioning | Whether or not to interpret the path as a Hive partitioned path. | BOOL | (auto-detected) |
union_by_name | Whether the columns of multiple schemas should be unified by name, rather than by position. | BOOL | false |
schema | Allows you to read a Parquet file as if it has the supplied schema. Field IDs are required. | MAP | NULL |
Using the schema Parameter
The schema parameter allows you to read the Parquet file using a specific schema. This is useful for renaming, adding, deleting, reordering, or casting columns when reading Parquet files.
To use the schema parameter, field IDs are required. To make them available when creating the Parquet using SereneDB, use:
COPY (SELECT 42::INTEGER AS i) TO 'integers.parquet' (FIELD_IDS {i: 0});Reading Parquet files:
SELECT *FROM read_parquet('integers.parquet', schema = MAP { 0: {name: 'renamed_i', type: 'BIGINT', default_value: NULL}, 1: {name: 'new_column', type: 'UTINYINT', default_value: 43} }); renamed_i | new_column-----------+------------ NULL | 43 NULL | 43Partial Reading
SereneDB supports projection pushdown into the Parquet file itself. That is to say, when querying a Parquet file, only the columns required for the query are read. This allows you to read only the part of the Parquet file that you are interested in. This will be done automatically by SereneDB.
SereneDB also supports filter pushdown into the Parquet reader. When you apply a filter to a column that is scanned from a Parquet file, the filter will be pushed down into the scan, and can even be used to skip parts of the file using the built-in zonemaps (also known as min-max indexes). Note that this will depend on whether or not your Parquet file contains zonemaps.
Filter and projection pushdown provide significant performance benefits.
Inserts and Views
You can also insert the data into a table or create a table from the Parquet file directly. This will load the data from the Parquet file and insert it into the database:
Insert the data from the Parquet file in the table:
INSERT INTO people SELECT * FROM read_parquet('test.parquet');Create a table directly from a Parquet file:
CREATE TABLE people AS SELECT * FROM read_parquet('test.parquet');If you wish to keep the data stored inside the Parquet file, but want to query the Parquet file directly, you can create a view over the read_parquet function. You can then query the Parquet file as if it were a built-in table:
Create a view over the Parquet file:
CREATE VIEW people AS SELECT * FROM read_parquet('test.parquet');Query the Parquet file:
SELECT * FROM people; id | name----+------ 1 | Ada 2 | BobWriting to Parquet Files
SereneDB also has support for writing to Parquet files using the COPY statement syntax. See the COPY Statement page for details, including all possible parameters for the COPY statement.
Write a query to a Snappy-compressed Parquet file:
COPY (SELECT * FROM tbl) TO 'result-snappy.parquet' (FORMAT parquet);Write tbl to a zstd-compressed Parquet file:
COPY tbl TO 'result-zstd.parquet' (FORMAT parquet, COMPRESSION zstd);Write tbl to a zstd-compressed Parquet file with the lowest compression level yielding the fastest compression:
COPY tbl TO 'result-zstd.parquet' (FORMAT parquet, COMPRESSION zstd, COMPRESSION_LEVEL 1);Write to Parquet file with key-value metadata:
COPY ( SELECT 42 AS number, true AS is_even) TO 'kv_metadata.parquet' ( FORMAT parquet, KV_METADATA { number: 'Answer to life, universe, and everything', is_even: 'not ''odd''' -- single quotes in values must be escaped });Write to a Parquet v2 file:
COPY tbl TO 'result-v2.parquet' (FORMAT parquet, PARQUET_VERSION 'V2');Write a CSV file to an uncompressed Parquet file:
COPY 'test.csv' TO 'result-uncompressed.parquet' (FORMAT parquet, COMPRESSION uncompressed);Write a query to a Parquet file with zstd-compression and row group size:
COPY (FROM generate_series(100_000)) TO 'row-groups-zstd.parquet' (FORMAT parquet, COMPRESSION zstd, ROW_GROUP_SIZE 100_000);Write data to an LZ4-compressed Parquet file:
COPY (FROM generate_series(100_000)) TO 'result-lz4.parquet' (FORMAT parquet, COMPRESSION lz4);Or, equivalently:
COPY (FROM generate_series(100_000)) TO 'result-lz4.parquet' (FORMAT parquet, COMPRESSION lz4_raw);Write data to a Brotli-compressed Parquet file:
COPY (FROM generate_series(100_000)) TO 'result-brotli.parquet' (FORMAT parquet, COMPRESSION brotli);To configure the page size of Parquet file's dictionary pages, use the STRING_DICTIONARY_PAGE_SIZE_LIMIT option (default: 1 MB):
COPY lineitem TO 'lineitem-with-custom-dictionary-size.parquet' (FORMAT parquet, STRING_DICTIONARY_PAGE_SIZE_LIMIT 100_000);SereneDB's EXPORT command can be used to export an entire database to a series of Parquet files. See the “EXPORT statement” page for more details:
Export the table contents of the entire database as Parquet:
EXPORT DATABASE 'target_directory' (FORMAT parquet);Encryption
SereneDB supports reading and writing encrypted Parquet files.
Supported Features
The list of supported Parquet features is available in the Parquet documentation's “Implementation status” page.