Skip to main content

Directly Reading Files

SereneDB allows directly reading files via the read_text and read_blob functions. These functions accept a filename, a list of filenames, or a glob pattern. They output the content of each file as a VARCHAR or BLOB, respectively, along with metadata such as the file size and last modified time.

read_text

The read_text table function reads from the selected source(s) to a VARCHAR. Each file results in a single row with the content field holding the entire content of the respective file.

Query
SELECT size, parse_path(filename), contentFROM read_text('test/sql/table_function/files/*.txt');
Result
 size | parse_path(filename)                      | content------+-------------------------------------------+--------------   12 | {test,sql,table_function,files,one.txt}   | Hello World!    2 | {test,sql,table_function,files,three.txt} | 42   10 | {test,sql,table_function,files,two.txt}   | Föö Bär

SereneDB first validates the file content as valid UTF-8. If read_text attempts to read a file with invalid UTF-8, SereneDB throws an error suggesting to use read_blob instead.

read_text also supports reading from pipes (e.g., /dev/stdin).

read_blob

The read_blob table function reads from the selected source(s) to a BLOB:

Query
SELECT size, content, filenameFROM read_blob('test/sql/table_function/files/*');
Result
 size | content                       | filename------+-------------------------------+-----------------------------------------   13 | \\x626c6f6220666978747572650a | test/sql/table_function/files/four.blob   13 | \\x48656c6c6f20576f726c64210a | test/sql/table_function/files/one.txt    3 | \\x34320a                     | test/sql/table_function/files/three.txt   11 | \\x46c3b6c3b62042c3a4720a     | test/sql/table_function/files/two.txt

The maximum allowed file size for read_blob is 3.9 GiB.

Schema

The schemas of the tables returned by read_text and read_blob are identical:

Query
DESCRIBE FROM read_text('README.md');
Result
 column_name   | column_type              | null | key  | default | extra---------------+--------------------------+------+------+---------+------- filename      | VARCHAR                  | YES  | NULL | NULL    | NULL content       | VARCHAR                  | YES  | NULL | NULL    | NULL size          | BIGINT                   | YES  | NULL | NULL    | NULL last_modified | TIMESTAMP WITH TIME ZONE | YES  | NULL | NULL    | NULL

Hive Partitioning

Data can be read from Hive partitioned datasets.

Query
SELECT *FROM read_blob('data/parquet-testing/hive-partitioning/simple/**/*.parquet')WHERE part IN ('a', 'b') AND date >= '2012-01-01';
Result
 filename                                                                            | content  | size     | last_modified       | date       | part-------------------------------------------------------------------------------------+----------+----------+---------------------+------------+------ data/parquet-testing/hive-partitioning/simple/part=a/date=2012-01-01/data_0.parquet | (varies) | (varies) | 2024-11-12 02:23:20 | 2012-01-01 | a data/parquet-testing/hive-partitioning/simple/part=b/date=2013-01-01/data_0.parquet | (varies) | (varies) | 2024-11-12 02:23:20 | 2013-01-01 | b

Handling Missing Metadata

When the underlying filesystem cannot provide this data (e.g., HTTPFS may not always return a valid timestamp), the cell is set to NULL instead.

Support for Projection Pushdown

These table functions also use projection pushdown to avoid computing properties unnecessarily. For example, you can glob a directory of large files to get file sizes in the size column. As long as you omit the content column, SereneDB won't read the file data.