3  Reading and writing data with Polars

An overview of Polars I/O capabilities:

Format Implementation Read Write Scan
Delimited files (CSV, TSV) native x x x
Spreadsheets (Excel, ODS) via external libraries x x
Parquet/Avro/IPC native x x x
JSON native x x x (NDJSON only)
Database via external libraries x x
Cloud storage (AWS S3, Azure blob storage, Google cloud storage) native x x x

About implementation: ‘Native’ means the I/O is handled by Polars itself — no external library is required. ‘Via external libraries’ means Polars delegates to a third-party package that must be installed separately.

About scan: Similar to the concept of lazy reading in the readr package, Polars allows you to scan an input file, which defers the actual parsing of the file and provides a LazyFrame, a lazy computation holder. This feature offers notable performance benefits:

3.1 Reading data

3.1.1 Delimited file (CSV, TSV)

For a complete list of parameter options to use with Polars CSV readers, see this page. These parameters provide similar functionality to readr’s arguments but with slightly different names. Here’s an example:

# Read a CSV file
flights_202212 = pl.read_csv(
    source='./data/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_12.csv',
    separator=',',                  # single character used to separate fields, default=','
    has_header=True,                # whether the first row contains column names, default=True
    infer_schema_length=1000,       # maximum number of lines to read for schema inference, default=100
    n_rows=10                       # maximum number of lines to read
)

# Display the first 3 rows
flights_202212.head(3)
shape: (3, 110)
Year Quarter Month DayofMonth DayOfWeek Div5TotalGTime Div5LongestGTime Div5WheelsOff Div5TailNum
i64 i64 i64 i64 i64 str str str str str
2022 4 12 19 1 null null "" "" null
2022 4 12 20 2 null null "" "" null
2022 4 12 21 3 null null "" "" null


For complex datasets, it is often useful to override inferred dtypes for specific columns.

# Read the CSV file with specified data types for selected columns
flights_202212 = pl.read_csv(
    source='./data/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_12.csv',
    separator=',',                  # single character used to separate fields, default=','
    has_header=True,                # whether the first row contains column names, default=True
    infer_schema_length=1000,       # maximum number of lines to read to infer schema, default=100
    n_rows=10,                      # maximum number of lines to read
    try_parse_dates=True, 
    schema_overrides={'Year':pl.Int32, 'Quarter':pl.Int32, 'Month': pl.Int32, 'Reporting_Airline': pl.Categorical} 
)

flights_202212.head(3)
shape: (3, 110)
Year Quarter Month DayofMonth DayOfWeek Div5TotalGTime Div5LongestGTime Div5WheelsOff Div5TailNum
i32 i32 i32 i64 i64 str str str str str
2022 4 12 19 1 null null "" "" null
2022 4 12 20 2 null null "" "" null
2022 4 12 21 3 null null "" "" null

3.1.2 Reading multiple files

Polars’ scan_*() method is an efficient technique for reading multiple files.

from pathlib import Path

def convert_bytes(size):
    """Convert bytes to a human-readable unit."""
    for x in ['bytes', 'KB', 'MB', 'GB', 'TB']:
        if size < 1024.0:
            return "%3.1f %s" % (size, x)
        size /= 1024.0

for p in Path('./data').rglob('On_Time*.csv'):
    print(p.name, ":", convert_bytes(p.stat().st_size))
On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_5.csv : 249.1 MB
On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_1.csv : 229.9 MB
On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_8.csv : 253.9 MB
On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_6.csv : 248.4 MB
On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_7.csv : 256.5 MB
On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_3.csv : 243.3 MB
On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_2.csv : 212.4 MB
On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_12.csv : 239.6 MB
On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_4.csv : 239.5 MB
On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_9.csv : 239.5 MB
On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_10.csv : 246.6 MB
On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_11.csv : 235.7 MB

To read multiple files into a single DataFrame, we can use globbing patterns:

no_flights_each_month = (
    pl.scan_csv('./data/On_Time*.csv')
    .group_by(['Year', 'Month'])
    .agg(
        pl.len().alias('Rows Count')
    )
    .sort('Month')
    .collect()
)

no_flights_each_month
shape: (12, 3)
Year Month Rows Count
i64 i64 u32
2022 1 537902
2022 2 495713
2022 3 564853
2022 4 556502
2022 5 578819
2022 8 589810
2022 9 557494
2022 10 572287
2022 11 546410
2022 12 557095

If each file can be processed independently, you can build one lazy query per file and execute all queries in parallel with Polars’ thread pool.

This pattern is embarrassingly parallel because the query plans do not need to communicate with each other.

queries = []
for p in Path('./data').rglob('On_Time*.csv'):
    q = (
        pl.scan_csv(p)
        .group_by(['Year', 'Month'])
        .agg(
            pl.len().alias('Rows Count')
        )
    )
    queries.append(q)

dfs = pl.concat(
    pl.collect_all(queries) # this returns a list of result DataFrames
)
dfs
shape: (12, 3)
Year Month Rows Count
i64 i64 u32
2022 5 578819
2022 1 537902
2022 8 589810
2022 6 577283
2022 7 594957
2022 12 557095
2022 4 556502
2022 9 557494
2022 10 572287
2022 11 546410

3.1.3 Reading spreadsheets

Polars can read Excel (.xlsx, .xls) and ODS files with read_excel(). Unlike CSV, Excel requires an external parsing engine — Polars supports three:

engine package notes
'calamine' (default) fastexcel fastest; Rust-based; recommended for most files
'xlsx2csv' xlsx2csv converts to CSV in memory then reads
'openpyxl' openpyxl slowest but handles unusual or malformed files
# R: readxl
library(readxl)
read_excel("./data/EWS.xlsx")                       # first sheet
read_excel("./data/EWS.xlsx", sheet = "Sheet1")     # by name
read_excel("./data/EWS.xlsx", sheet = 2)            # by position
# Polars: first sheet (default)
pl.read_excel("./data/EWS.xlsx").shape
(29245, 50)
# specific sheet by name
pl.read_excel("./data/EWS.xlsx", sheet_name="Sheet2").shape
(28136, 50)

When a workbook contains multiple sheets, pass sheet_id=0 to read all sheets at once — Polars returns a dictionary keyed by sheet name:

# read all sheets into a dict of DataFrames
sheets = pl.read_excel("./data/EWS.xlsx", sheet_id=0)
list(sheets.keys())
['Sheet1', 'Sheet2']
sheet1_df = sheets["Sheet1"]
sheet1_df.shape
(29245, 50)

The equivalent in R is iterating with purrr::map() over readxl::excel_sheets():

# R: read all sheets
library(purrr)
path <- "./data/EWS.xlsx"
sheets <- excel_sheets(path) |>
    set_names() |>
    map(read_excel, path = path)

You can override inferred column types with schema_overrides=, matching readxl’s col_types=:

pl.read_excel(
    "./data/EWS.xlsx",
    sheet_name="Sheet1",
    schema_overrides={"VAR_001": pl.Int32, "VAR_002": pl.Int64}
).select(["VAR_001", "VAR_002"]).head(3)
shape: (3, 2)
VAR_001 VAR_002
i32 i64
202003 13879640
202003 14715576
202003 14893869

3.1.4 Reading a SQL table

Polars can read query results from many SQL databases directly, returning a DataFrame without going through pandas. There are two functions:

  • pl.read_database_uri(query, uri) — takes a connection string; uses the connectorx engine by default
  • pl.read_database(query, connection) — takes an existing connection object (DBAPI2 or ADBC)

Conceptually, both are closest to R’s DBI::dbGetQuery(): Polars executes the SQL first and returns a DataFrame eagerly.

Reading from SQLite via a URI string:

# R: DBI + RSQLite
library(DBI)
con <- dbConnect(RSQLite::SQLite(), "./data/ontime.sqlite")
dbGetQuery(con, "SELECT Reporting_Airline, COUNT(*) AS n_flights
                 FROM ontime GROUP BY Reporting_Airline ORDER BY n_flights DESC")
dbDisconnect(con)
import os
db_path = os.path.abspath("./data/ontime.sqlite")
uri = f"sqlite:///{db_path}"

ohio_flights = pl.read_database_uri(
    query="""
        SELECT FlightDate, Reporting_Airline, Origin, Dest, DepDelay
        FROM ontime
        WHERE OriginStateName = 'Ohio'
        LIMIT 5
    """,
    uri=uri
)
ohio_flights
shape: (5, 5)
FlightDate Reporting_Airline Origin Dest DepDelay
str str str str f64
"2022-01-14" "YX" "CMH" "DCA" -3.0
"2022-01-15" "YX" "CMH" "DCA" -10.0
"2022-01-16" "YX" "CMH" "DCA" -6.0
"2022-01-17" "YX" "CMH" "DCA" -7.0
"2022-01-18" "YX" "CMH" "DCA" -6.0

Reading from DuckDB via a connection object:

DuckDB integrates natively with Polars via its Python package — pass an open connection directly to read_database():

# R: DBI + duckdb
library(DBI)
con <- dbConnect(duckdb::duckdb(), "./data/ontime.duckdb")
dbGetQuery(con, "SELECT Reporting_Airline, COUNT(*) AS n_flights
                 FROM ontime GROUP BY Reporting_Airline ORDER BY n_flights DESC")
dbDisconnect(con)
import duckdb

conn = duckdb.connect("./data/ontime.duckdb")
airline_counts = pl.read_database(
    query="""
        SELECT Reporting_Airline, COUNT(*) AS n_flights
        FROM ontime
        GROUP BY Reporting_Airline
        ORDER BY n_flights DESC
    """,
    connection=conn
)
conn.close()
airline_counts
shape: (17, 2)
Reporting_Airline n_flights
str i64
"WN" 1307149
"DL" 893172
"AA" 874145
"OO" 733576
"UA" 626473
"F9" 155482
"G4" 116877
"YV" 114779
"QX" 88791
"HA" 73865

For large result sets, read_database_uri supports partitioned parallel reads via partition_on=, partition_range=, and partition_num= — useful when fetching large result sets from a database table with a partitionable integer column.

3.2 Writing data

3.2.1 Writing CSV

write_csv() writes a DataFrame to a delimited file. The main parameters map directly to readr::write_csv():

# R: readr
write_csv(full_flights, "./output/flights_clean.csv")
write_csv(full_flights, "./output/flights_clean.csv.gz")   # auto-compressed
# write to CSV
full_flights.write_csv("./output/flights_clean.csv")

# write with gzip compression
full_flights.write_csv("./output/flights_clean.csv.gz", compression="gzip")

Common options:

parameter default purpose
separator ',' field delimiter (use '\t' for TSV)
include_header True write column names as first row
null_value '' string to write for null cells
date_format auto format string for Date columns
compression 'uncompressed' 'gzip' or 'zstd'

3.2.2 Writing Parquet

Parquet is the recommended format for storing DataFrames on disk: it is columnar, strongly typed, and compresses very efficiently. Writing is a single method call:

# R: arrow
library(arrow)
write_parquet(full_flights, "./output/flights.parquet")
full_flights.write_parquet("./output/flights.parquet")

# choose compression codec (zstd is the default and generally best)
full_flights.write_parquet("./output/flights.parquet", compression="snappy")

For very large tables, partition_by= writes a directory of per-partition Parquet files (Hive-style), which cloud query engines and scan_parquet() can then read with partition pruning:

# write one file per carrier: ./output/Reporting_Airline=AA/data.parquet, etc.
full_flights.write_parquet(
    "./output/flights_partitioned/",
    partition_by="Reporting_Airline"
)

# read back with predicate pushdown — only the AA partition is loaded
aa_flights = pl.scan_parquet(
    "./output/flights_partitioned/",
    hive_partitioning=True
).filter(pl.col("Reporting_Airline") == "AA").collect()

3.2.3 Writing Excel

write_excel() requires xlsxwriter (uv pip install xlsxwriter). It produces a formatted .xlsx file and supports multiple sheets, column widths, conditional formatting, and sparklines:

# R: writexl
library(writexl)
write_xlsx(full_flights, "./output/flights.xlsx")
# basic write
full_flights.write_excel("./output/flights.xlsx")

# custom sheet name with auto-fitted column widths
full_flights.write_excel(
    "./output/flights.xlsx",
    worksheet="January 2022",
    autofit=True
)

To write multiple DataFrames to separate sheets in one workbook, pass an xlsxwriter.Workbook object:

import xlsxwriter

with xlsxwriter.Workbook("./output/flights_multi.xlsx") as wb:
    full_flights.filter(pl.col("OriginStateName") == "Ohio").write_excel(wb, worksheet="Ohio")
    full_flights.filter(pl.col("OriginStateName") == "Texas").write_excel(wb, worksheet="Texas")

3.2.4 Writing to a database

write_database() inserts a DataFrame into a database table. It accepts either a URI string or an open connection object, and mirrors DBI::dbWriteTable() in R.

# R: DBI
library(DBI)
con <- dbConnect(RSQLite::SQLite(), "./data/ontime.sqlite")
dbWriteTable(con, "airline_summary", airline_summary, overwrite = TRUE)
dbDisconnect(con)

Here we compute a per-airline summary from the January 2022 dataset and write it back to the SQLite database:

import os
import polars as pl

# define the SQLite connection URI locally so this cell is self-contained
uri = f"sqlite:///{os.path.abspath('./data/ontime.sqlite')}"

full_flights = pl.read_csv(
    './data/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_1.csv'
)

airline_summary = (
    full_flights
    .group_by("Reporting_Airline")
    .agg(
        pl.len().alias("n_flights"),
        pl.col("DepDelay").mean().round(1).alias("avg_dep_delay"),
        pl.col("ArrDelay").mean().round(1).alias("avg_arr_delay"),
    )
    .sort("n_flights", descending=True)
)

airline_summary.write_database(
    table_name="airline_summary",
    connection=uri,
    if_table_exists="replace"   # 'fail' | 'replace' | 'append'
)
17

The if_table_exists= parameter controls behaviour when the target table already exists. Read the table back to verify the write:

pl.read_database_uri(
    query="SELECT * FROM airline_summary ORDER BY n_flights DESC",
    uri=uri
)
shape: (17, 4)
Reporting_Airline n_flights avg_dep_delay avg_arr_delay
str i64 f64 f64
"WN" 97436 10.2 1.1
"AA" 69400 7.9 -1.1
"DL" 68963 7.9 -0.2
"OO" 63146 14.8 7.9
"UA" 45741 13.1 4.8
"F9" 12039 15.6 11.0
"YV" 11404 19.0 11.2
"G4" 8714 12.2 12.2
"QX" 8105 12.2 11.0
"HA" 5868 9.6 6.9