3  Reading and writing data with Polars

An overview of Polars I/O capabilities:

Format Implementation Read Write Scan
Delimited files (CSV, TSV) native x x x
Spreadsheets (Excel, ODS) via external libraries x x
Parquet/Avro/IPC native x x x
JSON native x x x (NDJSON only)
Database via external libraries x x
Cloud storage (AWS S3, Azure blob storage, Google cloud storage) native x x x

A note regarding implementation: A Polars function or method that is entirely written in Rust is referred to as ‘native’. Though this definition may not be strictly accurate, it is acceptable to use it for distinguishing purpose, particularly in contrast to invoking functions/methods from external packages.

A note about scan: Similar to the concept of lazy reading in the readr package, Polars allows you to scan an input file, which defers the actual parsing of the file and provides a LazyFrame, a lazy computation holder. This feature offers notable performance benefits:

3.1 Reading data

3.1.1 Delimited file (CSV, TSV)

For a complete list of parameter options to use with Polars CSV readers, see this page. These parameters provides the same functionality as readr’s arguments but with slightly different names. Here’s an example:

# Read a CSV file
flights_202212 = pl.read_csv(
    source='./data/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_12.csv',
    separator=',',                  # single character used to separate fields, default=True
    has_header=True,                # flag indicating whether the first row contains header, default=True
    infer_schema_length=1000,       # maximum number of lines to read for schema inference, default=100
    n_rows=10                       # maximum number of lines to read
)

# Display the first 3 rows
flights_202212.head(3)
shape: (3, 110)
Year Quarter Month DayofMonth DayOfWeek Div5TotalGTime Div5LongestGTime Div5WheelsOff Div5TailNum
i64 i64 i64 i64 i64 str str str str str
2022 4 12 19 1 null null "" "" null
2022 4 12 20 2 null null "" "" null
2022 4 12 21 3 null null "" "" null


In a complex data set, it is common to override data types for specific columns

# Read the CSV file with specified data types for selected columns
flights_202212 = pl.read_csv(
    source='./data/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_12.csv',
    separator=',',                  # single character used to separate fields, default=True
    has_header=True,                # flag the first row has header or not, default=True
    infer_schema_length=1000,       # maximum number of lines to read to infer schema, default=100
    n_rows=10,                      # maximum number of lines to read
    try_parse_dates=True, 
    schema_overrides={'Year':pl.Int32, 'Quarter':pl.Int32, 'Month': pl.Int32, 'Reporting_Airline': pl.Categorical} 
)

flights_202212.head(3)
shape: (3, 110)
Year Quarter Month DayofMonth DayOfWeek Div5TotalGTime Div5LongestGTime Div5WheelsOff Div5TailNum
i32 i32 i32 i64 i64 str str str str str
2022 4 12 19 1 null null "" "" null
2022 4 12 20 2 null null "" "" null
2022 4 12 21 3 null null "" "" null

3.1.2 Reading multiple files

Polars’ scan_*() method is a really neat technique for reading multiple files efficiently.

from pathlib import Path

def convert_bytes(size):
    """ Convert bytes to KB, or MB or GB"""
    for x in ['bytes', 'KB', 'MB', 'GB', 'TB']:
        if size < 1024.0:
            return "%3.1f %s" % (size, x)
        size /= 1024.0

for p in Path.cwd().rglob('data/On_Time*.csv'):
    print(p.name, ":", convert_bytes(p.stat().st_size))
On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_5.csv : 249.1 MB
On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_1.csv : 229.9 MB
On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_8.csv : 253.9 MB
On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_6.csv : 248.4 MB
On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_7.csv : 256.5 MB
On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_3.csv : 243.3 MB
On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_2.csv : 212.4 MB
On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_12.csv : 239.6 MB
On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_4.csv : 239.5 MB
On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_9.csv : 239.5 MB
On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_10.csv : 246.6 MB
On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_11.csv : 235.7 MB

To read multiple files into a single DataFrame, we can use globbing patterns:

no_flights_each_month = (
    pl.scan_csv('./data/On_Time*.csv')
    .group_by(['Year', 'Month'])
    .agg(
        pl.len().alias('Rows Count')
    )
    .sort('Month')
    .collect()
)

no_flights_each_month
shape: (12, 3)
Year Month Rows Count
i64 i64 u32
2022 1 537902
2022 2 495713
2022 3 564853
2022 4 556502
2022 5 578819
2022 8 589810
2022 9 557494
2022 10 572287
2022 11 546410
2022 12 557095

If your files don’t have to be in a single table you can also build a query plan for each file and execute them in parallel on the Polars thread pool.

All query plan execution is embarrassingly parallel and doesn’t require any communication.

queries = []
for p in Path.cwd().rglob('data/On_Time*.csv'):
    q = (
        pl.scan_csv(p)
        .group_by(['Year', 'Month'])
        .agg(
            pl.len().alias('Rows Count')
        )
    )
    queries.append(q)

dfs = pl.concat(
    pl.collect_all(queries) # this returns a dict of result DataFrame
)
dfs
shape: (12, 3)
Year Month Rows Count
i64 i64 u32
2022 5 578819
2022 1 537902
2022 8 589810
2022 6 577283
2022 7 594957
2022 12 557095
2022 4 556502
2022 9 557494
2022 10 572287
2022 11 546410

3.1.3 Reading spreadsheets

3.1.4 Reading a SQL table

3.1.5 Reading S3

3.2 Writing data