Edit this page

Overview

Examples

Read a single Parquet file:

Query

SELECT * FROM 'test.parquet';

Result

 id | name  | value----+-------+-------  1 | alpha |    10  2 | beta  |    20

Figure out which columns/types are in a Parquet file:

Query

DESCRIBE SELECT * FROM 'test.parquet';

Result

 column_name | column_type | null | key  | default | extra-------------+-------------+------+------+---------+------- id          | INTEGER     | YES  | NULL | NULL    | NULL name        | VARCHAR     | YES  | NULL | NULL    | NULL value       | INTEGER     | YES  | NULL | NULL    | NULL

Create a table from a Parquet file:

Query

CREATE TABLE test AS    SELECT * FROM 'test.parquet';

If the file does not end in .parquet, use the read_parquet function:

Query

SELECT *FROM read_parquet('test.parq');

Result

 id | name  | value----+-------+-------  1 | alpha |    10  2 | beta  |    20

Use list parameter to read three Parquet files and treat them as a single table:

Query

SELECT *FROM read_parquet(['file1.parquet', 'file2.parquet', 'file3.parquet']);

Result

 id | name  | value----+-------+-------  1 | alpha |    10  2 | beta  |    20  1 | alpha |    10  2 | beta  |    20  1 | alpha |    10  2 | beta  |    20

Read all files that match the glob pattern:

Query

SELECT *FROM 'test/*.parquet';

Result

 id | name  | value----+-------+-------  1 | alpha |    10  2 | beta  |    20  1 | alpha |    10  2 | beta  |    20

Read all files that match the glob pattern, and include the filename virtual column that specifies which file each row came from:

Query

SELECT *, filenameFROM read_parquet('test/*.parquet');

Result

 id | name  | value | filename----+-------+-------+------------------------  1 | alpha |    10 | test/parquet_1.parquet  2 | beta  |    20 | test/parquet_1.parquet  1 | alpha |    10 | test/parquet_2.parquet  2 | beta  |    20 | test/parquet_2.parquet

Use a list of globs to read all Parquet files from two specific folders:

Query

SELECT *FROM read_parquet(['folder1/*.parquet', 'folder2/*.parquet']);

Result

 id | name  | value----+-------+-------  1 | alpha |    10  2 | beta  |    20  1 | alpha |    10  2 | beta  |    20  1 | alpha |    10  2 | beta  |    20  1 | alpha |    10  2 | beta  |    20

Read over HTTPS:

Query

SELECT *FROM read_parquet('https://some.url/some_file.parquet');

Result

 id | name  | value----+-------+-------  1 | alpha |    10  2 | beta  |    20

Query the metadata of a Parquet file:

Query

SELECT *FROM parquet_metadata('test.parquet');

Result

 file_name    | row_group_id | row_group_num_rows | row_group_num_columns | row_group_bytes | column_id | file_offset | num_values | path_in_schema | type       | stats_min | stats_max | stats_null_count | stats_distinct_count | stats_min_value | stats_max_value | compression | encodings | index_page_offset | dictionary_page_offset | data_page_offset | total_compressed_size | total_uncompressed_size | key_value_metadata | bloom_filter_offset | bloom_filter_length | min_is_exact | max_is_exact | row_group_compressed_bytes | geo_bbox | geo_types--------------+--------------+--------------------+-----------------------+-----------------+-----------+-------------+------------+----------------+------------+-----------+-----------+------------------+----------------------+-----------------+-----------------+-------------+-----------+-------------------+------------------------+------------------+-----------------------+-------------------------+--------------------+---------------------+---------------------+--------------+--------------+----------------------------+----------+----------- test.parquet |            0 |                  2 |                     3 | (varies)        |         0 | (varies)    |          2 | id             | INT32      | 1         | 2         |                0 | NULL                 | 1               | 2               | SNAPPY      | PLAIN     | NULL              | NULL                   | (varies)         | (varies)              | (varies)                | {}                 | NULL                | NULL                | t            | t            | (varies)                   | NULL     | NULL test.parquet |            0 |                  2 |                     3 | (varies)        |         1 | (varies)    |          2 | name           | BYTE_ARRAY | alpha     | beta      |                0 | NULL                 | alpha           | beta            | SNAPPY      | PLAIN     | NULL              | NULL                   | (varies)         | (varies)              | (varies)                | {}                 | NULL                | NULL                | t            | t            | (varies)                   | NULL     | NULL test.parquet |            0 |                  2 |                     3 | (varies)        |         2 | (varies)    |          2 | value          | INT32      | 10        | 20        |                0 | NULL                 | 10              | 20              | SNAPPY      | PLAIN     | NULL              | NULL                   | (varies)         | (varies)              | (varies)                | {}                 | NULL                | NULL                | t            | t            | (varies)                   | NULL     | NULL

Query the file metadata of a Parquet file:

Query

SELECT *FROM parquet_file_metadata('test.parquet');

Result

 file_name    | created_by | num_rows | num_row_groups | format_version | encryption_algorithm | footer_signing_key_metadata | file_size_bytes | footer_size | column_orders--------------+------------+----------+----------------+----------------+----------------------+-----------------------------+-----------------+-------------+------------------------------------------------------------------------------------------------------------------------------------ test.parquet | (varies)   |        2 |              1 |              1 | NULL                 | NULL                        | (varies)        | (varies)    | {ColumnOrder(TYPE_ORDER=TypeDefinedOrder()),ColumnOrder(TYPE_ORDER=TypeDefinedOrder()),ColumnOrder(TYPE_ORDER=TypeDefinedOrder())}

Query the key-value metadata of a Parquet file:

Query

SELECT *FROM parquet_kv_metadata('test.parquet');

Result

 file_name    | key               | value--------------+-------------------+------------------------------------------------------------------------------------- test.parquet | \\x69735f6576656e | \\x6e6f7420276f646427 test.parquet | \\x6e756d626572   | \\x416e7377657220746f206c6966652c20756e6976657273652c20616e642065766572797468696e67

Query the schema of a Parquet file:

Query

SELECT *FROM parquet_schema('test.parquet');

Result

 file_name    | name          | type       | type_length | repetition_type | num_children | converted_type | scale | precision | field_id | logical_type | duckdb_type | column_id--------------+---------------+------------+-------------+-----------------+--------------+----------------+-------+-----------+----------+--------------+-------------+----------- test.parquet | duckdb_schema | NULL       | NULL        | REQUIRED        |            3 | NULL           | NULL  | NULL      | NULL     | NULL         | NULL        |         0 test.parquet | id            | INT32      | NULL        | OPTIONAL        |         NULL | INT_32         | NULL  | NULL      | NULL     | NULL         | INTEGER     |         1 test.parquet | name          | BYTE_ARRAY | NULL        | OPTIONAL        |         NULL | UTF8           | NULL  | NULL      | NULL     | NULL         | VARCHAR     |         2 test.parquet | value         | INT32      | NULL        | OPTIONAL        |         NULL | INT_32         | NULL  | NULL      | NULL     | NULL         | INTEGER     |         3

Write the results of a query to a Parquet file using the default compression (Snappy):

Query

COPY    (SELECT * FROM tbl)    TO 'result-snappy.parquet'    (FORMAT parquet);

Write the results from a query to a Parquet file with specific compression and row group size:

Query

COPY    (FROM generate_series(100_000))    TO 'test.parquet'    (FORMAT parquet, COMPRESSION zstd, ROW_GROUP_SIZE 100_000);

Export the table contents of the entire database as parquet:

Query

EXPORT DATABASE 'target_directory' (FORMAT parquet);

Parquet Files

Parquet files are compressed columnar files that are efficient to load and process. SereneDB provides support for both reading and writing Parquet files in an efficient manner, as well as support for pushing filters and projections into the Parquet file scans.

`read_parquet` Function

Function	Description	Example
`read_parquet(path_or_list_of_paths)`	Read Parquet file(s)	`SELECT * FROM read_parquet('test.parquet');`
`parquet_scan(path_or_list_of_paths)`	Alias for `read_parquet`	`SELECT * FROM parquet_scan('test.parquet');`

If your file ends in .parquet, the function syntax is optional. The system will automatically infer that you are reading a Parquet file:

Query

SELECT * FROM 'test.parquet';

Result

 id | name  | value----+-------+-------  1 | alpha |    10  2 | beta  |    20

Multiple files can be read at once by providing a glob or a list of files. Refer to the multiple files section for more information.

Parameters

There are a number of options exposed that can be passed to the read_parquet function or the COPY statement.

Name	Description	Type	Default
`binary_as_string`	Parquet files generated by legacy writers do not correctly set the `UTF8` flag for strings, causing string columns to be loaded as `BLOB` instead. Set this to true to load binary columns as strings.	`BOOL`	`false`
`encryption_config`	Configuration for Parquet encryption.	`STRUCT`	-
`filename`	Whether or not an extra `filename` column should be included in the result. The `filename` column is added automatically as a virtual column and this option is only kept for compatibility reasons.	`BOOL`	`false`
`file_row_number`	Whether or not to include the `file_row_number` column.	`BOOL`	`false`
`hive_partitioning`	Whether or not to interpret the path as a Hive partitioned path.	`BOOL`	(auto-detected)
`union_by_name`	Whether the columns of multiple schemas should be unified by name, rather than by position.	`BOOL`	`false`
`schema`	Allows you to read a Parquet file as if it has the supplied schema. Field IDs are required.	`MAP`	`NULL`

Using the `schema` Parameter

The schema parameter allows you to read the Parquet file using a specific schema. This is useful for renaming, adding, deleting, reordering, or casting columns when reading Parquet files.

To use the schema parameter, field IDs are required. To make them available when creating the Parquet using SereneDB, use:

Query

COPY (SELECT 42::INTEGER AS i) TO 'integers.parquet' (FIELD_IDS {i: 0});

Reading Parquet files:

Query

SELECT *FROM read_parquet('integers.parquet', schema = MAP {                    0: {name: 'renamed_i', type: 'BIGINT', default_value: NULL},                    1: {name: 'new_column', type: 'UTINYINT', default_value: 43}                  });

Result

 renamed_i | new_column-----------+------------ NULL      |         43 NULL      |         43

Partial Reading

SereneDB supports projection pushdown into the Parquet file itself. That is to say, when querying a Parquet file, only the columns required for the query are read. This allows you to read only the part of the Parquet file that you are interested in. This will be done automatically by SereneDB.

SereneDB also supports filter pushdown into the Parquet reader. When you apply a filter to a column that is scanned from a Parquet file, the filter will be pushed down into the scan, and can even be used to skip parts of the file using the built-in zonemaps (also known as min-max indexes). Note that this will depend on whether or not your Parquet file contains zonemaps.

Filter and projection pushdown provide significant performance benefits.

Inserts and Views

You can also insert the data into a table or create a table from the Parquet file directly. This will load the data from the Parquet file and insert it into the database:

Insert the data from the Parquet file in the table:

Query

INSERT INTO people    SELECT * FROM read_parquet('test.parquet');

Create a table directly from a Parquet file:

Query

CREATE TABLE people AS    SELECT * FROM read_parquet('test.parquet');

If you wish to keep the data stored inside the Parquet file, but want to query the Parquet file directly, you can create a view over the read_parquet function. You can then query the Parquet file as if it were a built-in table:

Create a view over the Parquet file:

Query

CREATE VIEW people AS    SELECT * FROM read_parquet('test.parquet');

Query the Parquet file:

Query

SELECT * FROM people;

Result

 id | name----+------  1 | Ada  2 | Bob

Writing to Parquet Files

SereneDB also has support for writing to Parquet files using the COPY statement syntax. See the COPY Statement page for details, including all possible parameters for the COPY statement.

Write a query to a Snappy-compressed Parquet file:

Query

COPY    (SELECT * FROM tbl)    TO 'result-snappy.parquet'    (FORMAT parquet);

Write tbl to a zstd-compressed Parquet file:

Query

COPY tbl    TO 'result-zstd.parquet'    (FORMAT parquet, COMPRESSION zstd);

Write tbl to a zstd-compressed Parquet file with the lowest compression level yielding the fastest compression:

Query

COPY tbl    TO 'result-zstd.parquet'    (FORMAT parquet, COMPRESSION zstd, COMPRESSION_LEVEL 1);

Write to Parquet file with key-value metadata:

Query

COPY (    SELECT        42 AS number,        true AS is_even) TO 'kv_metadata.parquet' (    FORMAT parquet,    KV_METADATA {        number: 'Answer to life, universe, and everything',        is_even: 'not ''odd''' -- single quotes in values must be escaped    });

Write to a Parquet v2 file:

Query

COPY tbl    TO 'result-v2.parquet'    (FORMAT parquet, PARQUET_VERSION 'V2');

Write a CSV file to an uncompressed Parquet file:

Query

COPY    'test.csv'    TO 'result-uncompressed.parquet'    (FORMAT parquet, COMPRESSION uncompressed);

Write a query to a Parquet file with zstd-compression and row group size:

Query

COPY    (FROM generate_series(100_000))    TO 'row-groups-zstd.parquet'    (FORMAT parquet, COMPRESSION zstd, ROW_GROUP_SIZE 100_000);

Write data to an LZ4-compressed Parquet file:

Query

COPY    (FROM generate_series(100_000))    TO 'result-lz4.parquet'    (FORMAT parquet, COMPRESSION lz4);

Or, equivalently:

Query

COPY    (FROM generate_series(100_000))    TO 'result-lz4.parquet'    (FORMAT parquet, COMPRESSION lz4_raw);

Write data to a Brotli-compressed Parquet file:

Query

COPY    (FROM generate_series(100_000))    TO 'result-brotli.parquet'    (FORMAT parquet, COMPRESSION brotli);

To configure the page size of Parquet file's dictionary pages, use the STRING_DICTIONARY_PAGE_SIZE_LIMIT option (default: 1 MB):

Query

COPY    lineitem    TO 'lineitem-with-custom-dictionary-size.parquet'    (FORMAT parquet, STRING_DICTIONARY_PAGE_SIZE_LIMIT 100_000);

SereneDB's EXPORT command can be used to export an entire database to a series of Parquet files. See the “EXPORT statement” page for more details:

Export the table contents of the entire database as Parquet:

Query

EXPORT DATABASE 'target_directory' (FORMAT parquet);

Encryption

SereneDB supports reading and writing encrypted Parquet files.

Supported Features

The list of supported Parquet features is available in the Parquet documentation's “Implementation status” page.

Examples​

Parquet Files​

read_parquet Function​

Parameters​

Using the schema Parameter​

Partial Reading​

Inserts and Views​

Writing to Parquet Files​

Encryption​

Supported Features​

Examples

Parquet Files

`read_parquet` Function

Parameters

Using the `schema` Parameter

Partial Reading

Inserts and Views

Writing to Parquet Files

Encryption

Supported Features