Cloud storage (AWS S3, Azure blob storage, Google cloud storage)
native
x
x
x
About implementation: ‘Native’ means the I/O is handled by Polars itself — no external library is required. ‘Via external libraries’ means Polars delegates to a third-party package that must be installed separately.
About scan: Similar to the concept of lazy reading in the readr package, Polars allows you to scan an input file, which defers the actual parsing of the file and provides a LazyFrame, a lazy computation holder. This feature offers notable performance benefits:
reduction of memory usage by reading only the necessary data
various optimizations by the query planner
3.1 Reading data
3.1.1 Delimited file (CSV, TSV)
For a complete list of parameter options to use with Polars CSV readers, see this page. These parameters provide similar functionality to readr’s arguments but with slightly different names. Here’s an example:
# Read a CSV fileflights_202212 = pl.read_csv( source='./data/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_12.csv', separator=',', # single character used to separate fields, default=',' has_header=True, # whether the first row contains column names, default=True infer_schema_length=1000, # maximum number of lines to read for schema inference, default=100 n_rows=10# maximum number of lines to read)# Display the first 3 rowsflights_202212.head(3)
shape: (3, 110)
Year
Quarter
Month
DayofMonth
DayOfWeek
…
Div5TotalGTime
Div5LongestGTime
Div5WheelsOff
Div5TailNum
i64
i64
i64
i64
i64
…
str
str
str
str
str
2022
4
12
19
1
…
null
null
""
""
null
2022
4
12
20
2
…
null
null
""
""
null
2022
4
12
21
3
…
null
null
""
""
null
For complex datasets, it is often useful to override inferred dtypes for specific columns.
# Read the CSV file with specified data types for selected columnsflights_202212 = pl.read_csv( source='./data/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_12.csv', separator=',', # single character used to separate fields, default=',' has_header=True, # whether the first row contains column names, default=True infer_schema_length=1000, # maximum number of lines to read to infer schema, default=100 n_rows=10, # maximum number of lines to read try_parse_dates=True, schema_overrides={'Year':pl.Int32, 'Quarter':pl.Int32, 'Month': pl.Int32, 'Reporting_Airline': pl.Categorical} )flights_202212.head(3)
shape: (3, 110)
Year
Quarter
Month
DayofMonth
DayOfWeek
…
Div5TotalGTime
Div5LongestGTime
Div5WheelsOff
Div5TailNum
i32
i32
i32
i64
i64
…
str
str
str
str
str
2022
4
12
19
1
…
null
null
""
""
null
2022
4
12
20
2
…
null
null
""
""
null
2022
4
12
21
3
…
null
null
""
""
null
3.1.2 Reading multiple files
Polars’ scan_*() method is an efficient technique for reading multiple files.
from pathlib import Pathdef convert_bytes(size):"""Convert bytes to a human-readable unit."""for x in ['bytes', 'KB', 'MB', 'GB', 'TB']:if size <1024.0:return"%3.1f%s"% (size, x) size /=1024.0for p in Path('./data').rglob('On_Time*.csv'):print(p.name, ":", convert_bytes(p.stat().st_size))
If each file can be processed independently, you can build one lazy query per file and execute all queries in parallel with Polars’ thread pool.
This pattern is embarrassingly parallel because the query plans do not need to communicate with each other.
queries = []for p in Path('./data').rglob('On_Time*.csv'): q = ( pl.scan_csv(p) .group_by(['Year', 'Month']) .agg( pl.len().alias('Rows Count') ) ) queries.append(q)dfs = pl.concat( pl.collect_all(queries) # this returns a list of result DataFrames)dfs
shape: (12, 3)
Year
Month
Rows Count
i64
i64
u32
2022
5
578819
2022
1
537902
2022
8
589810
2022
6
577283
2022
7
594957
…
…
…
2022
12
557095
2022
4
556502
2022
9
557494
2022
10
572287
2022
11
546410
3.1.3 Reading spreadsheets
Polars can read Excel (.xlsx, .xls) and ODS files with read_excel(). Unlike CSV, Excel requires an external parsing engine — Polars supports three:
engine
package
notes
'calamine' (default)
fastexcel
fastest; Rust-based; recommended for most files
'xlsx2csv'
xlsx2csv
converts to CSV in memory then reads
'openpyxl'
openpyxl
slowest but handles unusual or malformed files
# R: readxllibrary(readxl)read_excel("./data/EWS.xlsx") # first sheetread_excel("./data/EWS.xlsx", sheet ="Sheet1") # by nameread_excel("./data/EWS.xlsx", sheet =2) # by position
# Polars: first sheet (default)pl.read_excel("./data/EWS.xlsx").shape
(29245, 50)
# specific sheet by namepl.read_excel("./data/EWS.xlsx", sheet_name="Sheet2").shape
(28136, 50)
When a workbook contains multiple sheets, pass sheet_id=0 to read all sheets at once — Polars returns a dictionary keyed by sheet name:
# read all sheets into a dict of DataFramessheets = pl.read_excel("./data/EWS.xlsx", sheet_id=0)list(sheets.keys())
['Sheet1', 'Sheet2']
sheet1_df = sheets["Sheet1"]sheet1_df.shape
(29245, 50)
The equivalent in R is iterating with purrr::map() over readxl::excel_sheets():
Polars can read query results from many SQL databases directly, returning a DataFrame without going through pandas. There are two functions:
pl.read_database_uri(query, uri) — takes a connection string; uses the connectorx engine by default
pl.read_database(query, connection) — takes an existing connection object (DBAPI2 or ADBC)
Conceptually, both are closest to R’s DBI::dbGetQuery(): Polars executes the SQL first and returns a DataFrame eagerly.
Reading from SQLite via a URI string:
# R: DBI + RSQLitelibrary(DBI)con <-dbConnect(RSQLite::SQLite(), "./data/ontime.sqlite")dbGetQuery(con, "SELECT Reporting_Airline, COUNT(*) AS n_flights FROM ontime GROUP BY Reporting_Airline ORDER BY n_flights DESC")dbDisconnect(con)
DuckDB integrates natively with Polars via its Python package — pass an open connection directly to read_database():
# R: DBI + duckdblibrary(DBI)con <-dbConnect(duckdb::duckdb(), "./data/ontime.duckdb")dbGetQuery(con, "SELECT Reporting_Airline, COUNT(*) AS n_flights FROM ontime GROUP BY Reporting_Airline ORDER BY n_flights DESC")dbDisconnect(con)
import duckdbconn = duckdb.connect("./data/ontime.duckdb")airline_counts = pl.read_database( query=""" SELECT Reporting_Airline, COUNT(*) AS n_flights FROM ontime GROUP BY Reporting_Airline ORDER BY n_flights DESC """, connection=conn)conn.close()airline_counts
shape: (17, 2)
Reporting_Airline
n_flights
str
i64
"WN"
1307149
"DL"
893172
"AA"
874145
"OO"
733576
"UA"
626473
…
…
"F9"
155482
"G4"
116877
"YV"
114779
"QX"
88791
"HA"
73865
For large result sets, read_database_uri supports partitioned parallel reads via partition_on=, partition_range=, and partition_num= — useful when fetching large result sets from a database table with a partitionable integer column.
3.2 Writing data
3.2.1 Writing CSV
write_csv() writes a DataFrame to a delimited file. The main parameters map directly to readr::write_csv():
# write to CSVfull_flights.write_csv("./output/flights_clean.csv")# write with gzip compressionfull_flights.write_csv("./output/flights_clean.csv.gz", compression="gzip")
Common options:
parameter
default
purpose
separator
','
field delimiter (use '\t' for TSV)
include_header
True
write column names as first row
null_value
''
string to write for null cells
date_format
auto
format string for Date columns
compression
'uncompressed'
'gzip' or 'zstd'
3.2.2 Writing Parquet
Parquet is the recommended format for storing DataFrames on disk: it is columnar, strongly typed, and compresses very efficiently. Writing is a single method call:
full_flights.write_parquet("./output/flights.parquet")# choose compression codec (zstd is the default and generally best)full_flights.write_parquet("./output/flights.parquet", compression="snappy")
For very large tables, partition_by= writes a directory of per-partition Parquet files (Hive-style), which cloud query engines and scan_parquet() can then read with partition pruning:
# write one file per carrier: ./output/Reporting_Airline=AA/data.parquet, etc.full_flights.write_parquet("./output/flights_partitioned/", partition_by="Reporting_Airline")# read back with predicate pushdown — only the AA partition is loadedaa_flights = pl.scan_parquet("./output/flights_partitioned/", hive_partitioning=True).filter(pl.col("Reporting_Airline") =="AA").collect()
3.2.3 Writing Excel
write_excel() requires xlsxwriter (uv pip install xlsxwriter). It produces a formatted .xlsx file and supports multiple sheets, column widths, conditional formatting, and sparklines:
write_database() inserts a DataFrame into a database table. It accepts either a URI string or an open connection object, and mirrors DBI::dbWriteTable() in R.