3 Reading and writing data with Polars

An overview of Polars I/O capabilities:

Format	Implementation	Read	Write	Scan
Delimited files (CSV, TSV)	native	x	x	x
Spreadsheets (Excel, ODS)	via external libraries	x	x
Parquet/Avro/IPC	native	x	x	x
JSON	native	x	x	x (NDJSON only)
Database	via external libraries	x	x
Cloud storage (AWS S3, Azure blob storage, Google cloud storage)	native	x	x	x

A note regarding implementation: A Polars function or method that is entirely written in Rust is referred to as ‘native’. Though this definition may not be strictly accurate, it is acceptable to use it for distinguishing purpose, particularly in contrast to invoking functions/methods from external packages.

A note about scan: Similar to the concept of lazy reading in the readr package, Polars allows you to scan an input file, which defers the actual parsing of the file and provides a LazyFrame, a lazy computation holder. This feature offers notable performance benefits:

reduction of memory usage by reading only the necessary data
various optimizations by query planner

3.1 Reading data

3.1.1 Delimited file (CSV, TSV)

For a complete list of parameter options to use with Polars CSV readers, see this page. These parameters provides the same functionality as readr’s arguments but with slightly different names. Here’s an example:

# Read a CSV file
flights_202212 = pl.read_csv(
    source='./data/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_12.csv',
    separator=',',                  # single character used to separate fields, default=True
    has_header=True,                # flag indicating whether the first row contains header, default=True
    infer_schema_length=1000,       # maximum number of lines to read for schema inference, default=100
    n_rows=10                       # maximum number of lines to read
)

# Display the first 3 rows
flights_202212.head(3)

shape: (3, 110)

Year	Quarter	Month	DayofMonth	DayOfWeek	…	Div5TotalGTime	Div5LongestGTime	Div5WheelsOff	Div5TailNum
i64	i64	i64	i64	i64	…	str	str	str	str	str
2022	4	12	19	1	…	null	null	""	""	null
2022	4	12	20	2	…	null	null	""	""	null
2022	4	12	21	3	…	null	null	""	""	null

In a complex data set, it is common to override data types for specific columns

# Read the CSV file with specified data types for selected columns
flights_202212 = pl.read_csv(
    source='./data/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_12.csv',
    separator=',',                  # single character used to separate fields, default=True
    has_header=True,                # flag the first row has header or not, default=True
    infer_schema_length=1000,       # maximum number of lines to read to infer schema, default=100
    n_rows=10,                      # maximum number of lines to read
    try_parse_dates=True, 
    schema_overrides={'Year':pl.Int32, 'Quarter':pl.Int32, 'Month': pl.Int32, 'Reporting_Airline': pl.Categorical} 
)

flights_202212.head(3)

shape: (3, 110)

Year	Quarter	Month	DayofMonth	DayOfWeek	…	Div5TotalGTime	Div5LongestGTime	Div5WheelsOff	Div5TailNum
i32	i32	i32	i64	i64	…	str	str	str	str	str
2022	4	12	19	1	…	null	null	""	""	null
2022	4	12	20	2	…	null	null	""	""	null
2022	4	12	21	3	…	null	null	""	""	null

3.1.2 Reading multiple files

Polars’ scan_*() method is a really neat technique for reading multiple files efficiently.

from pathlib import Path

def convert_bytes(size):
    """ Convert bytes to KB, or MB or GB"""
    for x in ['bytes', 'KB', 'MB', 'GB', 'TB']:
        if size < 1024.0:
            return "%3.1f %s" % (size, x)
        size /= 1024.0

for p in Path.cwd().rglob('data/On_Time*.csv'):
    print(p.name, ":", convert_bytes(p.stat().st_size))

On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_5.csv : 249.1 MB
On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_1.csv : 229.9 MB
On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_8.csv : 253.9 MB
On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_6.csv : 248.4 MB
On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_7.csv : 256.5 MB
On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_3.csv : 243.3 MB
On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_2.csv : 212.4 MB
On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_12.csv : 239.6 MB
On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_4.csv : 239.5 MB
On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_9.csv : 239.5 MB
On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_10.csv : 246.6 MB
On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_11.csv : 235.7 MB

To read multiple files into a single DataFrame, we can use globbing patterns:

no_flights_each_month = (
    pl.scan_csv('./data/On_Time*.csv')
    .group_by(['Year', 'Month'])
    .agg(
        pl.len().alias('Rows Count')
    )
    .sort('Month')
    .collect()
)

no_flights_each_month

shape: (12, 3)

Year	Month	Rows Count
i64	i64	u32
2022	1	537902
2022	2	495713
2022	3	564853
2022	4	556502
2022	5	578819
…	…	…
2022	8	589810
2022	9	557494
2022	10	572287
2022	11	546410
2022	12	557095

If your files don’t have to be in a single table you can also build a query plan for each file and execute them in parallel on the Polars thread pool.

All query plan execution is embarrassingly parallel and doesn’t require any communication.

queries = []
for p in Path.cwd().rglob('data/On_Time*.csv'):
    q = (
        pl.scan_csv(p)
        .group_by(['Year', 'Month'])
        .agg(
            pl.len().alias('Rows Count')
        )
    )
    queries.append(q)

dfs = pl.concat(
    pl.collect_all(queries) # this returns a dict of result DataFrame
)
dfs

shape: (12, 3)

Year	Month	Rows Count
i64	i64	u32
2022	5	578819
2022	1	537902
2022	8	589810
2022	6	577283
2022	7	594957
…	…	…
2022	12	557095
2022	4	556502
2022	9	557494
2022	10	572287
2022	11	546410

3 Reading and writing data with Polars

3.1 Reading data

3.1.1 Delimited file (CSV, TSV)

3.1.2 Reading multiple files

3.1.3 Reading spreadsheets

3.1.4 Reading a SQL table

3.1.5 Reading S3

3.2 Writing data