Edit this page

Tips

Below is a collection of tips to help when attempting to import complex CSV files. In the examples, we use the flights.csv file.

Override the Header Flag if the Header Is Not Correctly Detected

If a file contains only string columns the header auto-detection might fail. Provide the header option to override this behavior.

Query

SELECT * FROM read_csv('flights.csv', header = true);

Result

 flightdate | uniquecarrier | origincityname    | destcityname------------+---------------+-------------------+----------------- 2024-01-01 | AA            | New York, NY      | Los Angeles, CA 2024-01-02 | UA            | San Francisco, CA | Seattle, WA

Provide Names if the File Does Not Contain a Header

If the file does not contain a header, names will be auto-generated by default. You can provide your own names with the names option.

Query

SELECT * FROM read_csv('flights.csv', names = ['DateOfFlight', 'CarrierName']);

Result

 dateofflight | carriername | column2           | column3--------------+-------------+-------------------+----------------- 2024-01-01   | AA          | New York, NY      | Los Angeles, CA 2024-01-02   | UA          | San Francisco, CA | Seattle, WA

Override the Types of Specific Columns

The types flag can be used to override types of only certain columns by providing a struct of name → type mappings.

Query

SELECT * FROM read_csv('flights.csv', types = {'FlightDate': 'DATE'});

Result

 flightdate | uniquecarrier | origincityname    | destcityname------------+---------------+-------------------+----------------- 2024-01-01 | AA            | New York, NY      | Los Angeles, CA 2024-01-02 | UA            | San Francisco, CA | Seattle, WA

Use `COPY` When Loading Data into a Table

The COPY statement copies data directly into a table. The CSV reader uses the schema of the table instead of auto-detecting types from the file. This speeds up the auto-detection, and prevents mistakes from being made during auto-detection.

Query

COPY tbl FROM 'test.csv';

Use `union_by_name` When Loading Files with Different Schemas

The union_by_name option can be used to unify the schema of files that have different or missing columns. For files that do not have certain columns, NULL values are filled in.

Query

SELECT * FROM read_csv('flights*.csv', union_by_name = true);

Result

 flightdate | uniquecarrier | origincityname    | destcityname------------+---------------+-------------------+----------------- 2024-01-01 | AA            | New York, NY      | Los Angeles, CA 2024-01-02 | UA            | San Francisco, CA | Seattle, WA 2024-01-01 | AA            | New York, NY      | Los Angeles, CA 2024-01-02 | UA            | San Francisco, CA | Seattle, WA

To load data into an existing table where the table has more columns than the CSV file, you can use the INSERT INTO ... BY NAME clause:

Query

INSERT INTO tbl BY NAME    SELECT * FROM read_csv('input.csv');

Sample Size

If the CSV sniffer is not detecting the correct type, try increasing the sample size. The option sample_size = -1 forces the sniffer to read the entire file:

Query

SELECT * FROM read_csv('my_csv_file.csv', sample_size = -1);

Result

 id | name  | value----+-------+-------  1 | alpha |    10  2 | beta  |    20

Override the Header Flag if the Header Is Not Correctly Detected​

Provide Names if the File Does Not Contain a Header​

Override the Types of Specific Columns​

Use COPY When Loading Data into a Table​

Use union_by_name When Loading Files with Different Schemas​

Sample Size​

Override the Header Flag if the Header Is Not Correctly Detected

Provide Names if the File Does Not Contain a Header

Override the Types of Specific Columns

Use `COPY` When Loading Data into a Table

Use `union_by_name` When Loading Files with Different Schemas

Sample Size