Cloud storage (AWS S3, Azure blob storage, Google cloud storage)
native
x
x
x
A note regarding implementation: A Polars function or method that is entirely written in Rust is referred to as ‘native’. Though this definition may not be strictly accurate, it is acceptable to use it for distinguishing purpose, particularly in contrast to invoking functions/methods from external packages.
A note about scan: Similar to the concept of lazy reading in the readr package, Polars allows you to scan an input file, which defers the actual parsing of the file and provides a LazyFrame, a lazy computation holder. This feature offers notable performance benefits:
reduction of memory usage by reading only the necessary data
various optimizations by query planner
3.1 Reading data
3.1.1 Delimited file (CSV, TSV)
For a complete list of parameter options to use with Polars CSV readers, see this page. These parameters provides the same functionality as readr’s arguments but with slightly different names. Here’s an example:
# Read a CSV fileflights_202212 = pl.read_csv( source='./data/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_12.csv', separator=',', # single character used to separate fields, default=True has_header=True, # flag indicating whether the first row contains header, default=True infer_schema_length=1000, # maximum number of lines to read for schema inference, default=100 n_rows=10# maximum number of lines to read)# Display the first 3 rowsflights_202212.head(3)
shape: (3, 110)
Year
Quarter
Month
DayofMonth
DayOfWeek
…
Div5TotalGTime
Div5LongestGTime
Div5WheelsOff
Div5TailNum
i64
i64
i64
i64
i64
…
str
str
str
str
str
2022
4
12
19
1
…
null
null
""
""
null
2022
4
12
20
2
…
null
null
""
""
null
2022
4
12
21
3
…
null
null
""
""
null
In a complex data set, it is common to override data types for specific columns
# Read the CSV file with specified data types for selected columnsflights_202212 = pl.read_csv( source='./data/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_12.csv', separator=',', # single character used to separate fields, default=True has_header=True, # flag the first row has header or not, default=True infer_schema_length=1000, # maximum number of lines to read to infer schema, default=100 n_rows=10, # maximum number of lines to read try_parse_dates=True, schema_overrides={'Year':pl.Int32, 'Quarter':pl.Int32, 'Month': pl.Int32, 'Reporting_Airline': pl.Categorical} )flights_202212.head(3)
shape: (3, 110)
Year
Quarter
Month
DayofMonth
DayOfWeek
…
Div5TotalGTime
Div5LongestGTime
Div5WheelsOff
Div5TailNum
i32
i32
i32
i64
i64
…
str
str
str
str
str
2022
4
12
19
1
…
null
null
""
""
null
2022
4
12
20
2
…
null
null
""
""
null
2022
4
12
21
3
…
null
null
""
""
null
3.1.2 Reading multiple files
Polars’ scan_*() method is a really neat technique for reading multiple files efficiently.
from pathlib import Pathdef convert_bytes(size):""" Convert bytes to KB, or MB or GB"""for x in ['bytes', 'KB', 'MB', 'GB', 'TB']:if size <1024.0:return"%3.1f%s"% (size, x) size /=1024.0for p in Path.cwd().rglob('data/On_Time*.csv'):print(p.name, ":", convert_bytes(p.stat().st_size))
If your files don’t have to be in a single table you can also build a query plan for each file and execute them in parallel on the Polars thread pool.
All query plan execution is embarrassingly parallel and doesn’t require any communication.
queries = []for p in Path.cwd().rglob('data/On_Time*.csv'): q = ( pl.scan_csv(p) .group_by(['Year', 'Month']) .agg( pl.len().alias('Rows Count') ) ) queries.append(q)dfs = pl.concat( pl.collect_all(queries) # this returns a dict of result DataFrame)dfs