1 First steps with Polars

1.1 Installation

Let’s kick off our journey into the world of data manipulation with the Polars library. First things first, we need to set up a project with uv, a fast Python package manager. Install the latest stable version with:

uv add polars

Depending on your use case, you might want to install the optional dependencies as well:

uv add 'polars[numpy,pandas,pyarrow]'

To gain insights into the installed Polars package, including version details and enabled features, utilize the show_versions() function:

import polars as pl
pl.show_versions()

--------Version info---------
Polars:              1.40.1
Index type:          UInt32
Platform:            Linux-6.8.0-111-generic-x86_64-with-glibc2.39
Python:              3.12.3 (main, Mar 23 2026, 19:04:32) [GCC 13.3.0]
Runtime:             rt32

----Optional dependencies----
Azure CLI            <not installed>
adbc_driver_manager  <not installed>
altair               <not installed>
azure.identity       <not installed>
boto3                <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               <not installed>
gevent               <not installed>
google.auth          <not installed>
great_tables         <not installed>
matplotlib           <not installed>
numpy                2.4.4
openpyxl             <not installed>
pandas               3.0.2
polars_cloud         <not installed>
pyarrow              24.0.0
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>

1.2 Initial exploration

For better presentation, this book use Polars’ configuration options to adjust the layout of printed tables. Specifically:

Limit the number of columns and rows displayed in DataFrames to 10 each
Apply stylish formatting to the tables for pleasant reading experience

pl.Config.set_tbl_cols(10)
pl.Config.set_tbl_rows(10)
pl.Config.set_tbl_formatting(rounded_corners=True)

polars.config.Config

1.2.1 Data loading

Reading a CSV file using Polars is straightforward. Let’s take a quick look:

flights = pl.read_csv('./data/flights.csv')

For those familiar with R’s dplyr, a similar method called glimpse() is available:

# Glimpse the first 10 columns
flights[:,0:10].glimpse()

Rows: 999
Columns: 10
$ Year                        <i64> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022
$ Quarter                     <i64> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
$ Month                       <i64> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
$ DayofMonth                  <i64> 14, 15, 16, 17, 18, 19, 20, 21, 22, 23
$ DayOfWeek                   <i64> 5, 6, 7, 1, 2, 3, 4, 5, 6, 7
$ FlightDate                  <str> '2022-01-14', '2022-01-15', '2022-01-16', '2022-01-17', '2022-01-18', '2022-01-19', '2022-01-20', '2022-01-21', '2022-01-22', '2022-01-23'
$ Reporting_Airline           <str> 'YX', 'YX', 'YX', 'YX', 'YX', 'YX', 'YX', 'YX', 'YX', 'YX'
$ DOT_ID_Reporting_Airline    <i64> 20452, 20452, 20452, 20452, 20452, 20452, 20452, 20452, 20452, 20452
$ IATA_CODE_Reporting_Airline <str> 'YX', 'YX', 'YX', 'YX', 'YX', 'YX', 'YX', 'YX', 'YX', 'YX'
$ Tail_Number                 <str> 'N119HQ', 'N122HQ', 'N412YX', 'N405YX', 'N420YX', 'N446YX', 'N116HQ', 'N419YX', 'N137HQ', 'N110HQ'

Standard commands from pandas such as head(), tail(), and describe() can be used seamlessly:

flights.head()

shape: (5, 110)

Year	Quarter	Month	DayofMonth	DayOfWeek	…	Div5TotalGTime	Div5LongestGTime	Div5WheelsOff	Div5TailNum
i64	i64	i64	i64	i64	…	str	str	str	str	str
2022	1	1	14	5	…	null	null	""	""	null
2022	1	1	15	6	…	null	null	""	""	null
2022	1	1	16	7	…	null	null	""	""	null
2022	1	1	17	1	…	null	null	""	""	null
2022	1	1	18	2	…	null	null	""	""	null

flights.tail()

shape: (5, 110)

Year	Quarter	Month	DayofMonth	DayOfWeek	…	Div5TotalGTime	Div5LongestGTime	Div5WheelsOff	Div5TailNum
i64	i64	i64	i64	i64	…	str	str	str	str	str
2022	1	1	12	3	…	null	null	""	""	null
2022	1	1	13	4	…	null	null	""	""	null
2022	1	1	14	5	…	null	null	""	""	null
2022	1	1	17	1	…	null	null	""	""	null
2022	1	1	18	2	…	null	null	""	""	null

flights.describe()

shape: (9, 111)

statistic	Year	Quarter	Month	DayofMonth	…	Div5TotalGTime	Div5LongestGTime	Div5WheelsOff	Div5TailNum
str	f64	f64	f64	f64	…	str	str	str	str	str
"count"	999.0	999.0	999.0	999.0	…	"0"	"0"	"999"	"999"	"0"
"null_count"	0.0	0.0	0.0	0.0	…	"999"	"999"	"0"	"0"	"999"
"mean"	2022.0	1.0	1.0	16.2002	…	null	null	null	null	null
"std"	0.0	0.0	0.0	8.802666	…	null	null	null	null	null
"min"	2022.0	1.0	1.0	1.0	…	null	null	""	""	null
"25%"	2022.0	1.0	1.0	9.0	…	null	null	null	null	null
"50%"	2022.0	1.0	1.0	16.0	…	null	null	null	null	null
"75%"	2022.0	1.0	1.0	24.0	…	null	null	null	null	null
"max"	2022.0	1.0	1.0	31.0	…	null	null	""	""	null

If you want to take a peek at different parts of your DataFrame, here’s a handy trick: use the sample() method. This method randomly picks n number of rows from the DataFrame and returns them for inspection.

flights.sample(3)

shape: (3, 110)

Year	Quarter	Month	DayofMonth	DayOfWeek	…	Div5TotalGTime	Div5LongestGTime	Div5WheelsOff	Div5TailNum
i64	i64	i64	i64	i64	…	str	str	str	str	str
2022	1	1	1	6	…	null	null	""	""	null
2022	1	1	21	5	…	null	null	""	""	null
2022	1	1	7	5	…	null	null	""	""	null

The output from Polars comes with some useful features:

Underneath each column name is a data type.
No index numbers are present.
String values are quoted with double quotes.
Missing values are represented as null, applicable to all data types.

1.2.2 Row and column counting

Determining the number of rows and columns in a Polars DataFrame is as simple as checking the shape:

flights.shape

(999, 110)

1.2.3 Converting from pandas

Transitioning from a Pandas DataFrame to a Polars DataFrame is effortless with the from_pandas() method:

For efficient zero-copy conversion, both Pandas and Polars rely on the pyarrow library as an interchange format.

import pandas as pd
flights2 = pl.from_pandas(pd.read_csv('./data/flights.csv'))

flights2[:,0:9].glimpse()

Rows: 999
Columns: 9
$ Year                        <i64> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022
$ Quarter                     <i64> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
$ Month                       <i64> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
$ DayofMonth                  <i64> 14, 15, 16, 17, 18, 19, 20, 21, 22, 23
$ DayOfWeek                   <i64> 5, 6, 7, 1, 2, 3, 4, 5, 6, 7
$ FlightDate                  <str> '2022-01-14', '2022-01-15', '2022-01-16', '2022-01-17', '2022-01-18', '2022-01-19', '2022-01-20', '2022-01-21', '2022-01-22', '2022-01-23'
$ Reporting_Airline           <str> 'YX', 'YX', 'YX', 'YX', 'YX', 'YX', 'YX', 'YX', 'YX', 'YX'
$ DOT_ID_Reporting_Airline    <i64> 20452, 20452, 20452, 20452, 20452, 20452, 20452, 20452, 20452, 20452
$ IATA_CODE_Reporting_Airline <str> 'YX', 'YX', 'YX', 'YX', 'YX', 'YX', 'YX', 'YX', 'YX', 'YX'

1.2.4 Understanding data structure

The fundamental data structures in Polars are Series and DataFrames:

Series is 1-dimensional data structure, akin to R’s atomic vector, where all elements must share the same data type.

# Create a named Series
s = pl.Series('a', [1, 2, 3, 2, 5])
s

shape: (5,)

a
i64
1
2
3
2
5

# Note that dtype of `s` is automatically inferred as Int64
s.dtype

Int64

Constructing a Series with a specific dtype:

s2 = pl.Series('a', [1, 2, 3], dtype=pl.Float32)
s2

shape: (3,)

a
f32
1.0
2.0
3.0

Series provides a wide range of methods for various operations, including standard statistical functions like .max(), .mean(), as well as specialized ones such as .entropy() and .unique_counts().

print(s.max())
print(s.mean())

5
2.6

s.unique_counts()

shape: (4,)

a
u32
1
2
1
1

DataFrames are 2-dimensional structures, similar to R’s data.frame, built on top of Series. In the examples below, borrowed from RealPython, we create DataFrame using two different approaches:

Creating DataFrame from a dictionary:

import numpy as np
num_rows = 5000
rng = np.random.default_rng(seed=7)

buildings_data = {
     "sqft": rng.exponential(scale=1000, size=num_rows),
     "year": rng.integers(low=1995, high=2023, size=num_rows),
     "building_type": rng.choice(["A", "B", "C"], size=num_rows),
 }
buildings = pl.DataFrame(buildings_data)
buildings

shape: (5_000, 3)

sqft	year	building_type
f64	i64	str
707.529256	1996	"C"
1025.203348	2020	"C"
568.548657	2012	"A"
895.109864	2000	"A"
206.532754	2011	"A"
…	…	…
710.435755	2003	"C"
408.872783	2009	"C"
57.562059	2019	"C"
3728.088949	2020	"C"
686.678345	2011	"C"

Creating DataFrame from multiple Series:

s1 = pl.Series('sqrf', rng.exponential(scale=1000, size=num_rows))
s2 = pl.Series('year', rng.integers(low=1995, high=2023, size=num_rows))
s3 = pl.Series('building_type', rng.choice(["A", "B", "C"], size=num_rows))

buildings2 = pl.DataFrame([s1, s2, s3])
buildings2

shape: (5_000, 3)

sqrf	year	building_type
f64	i64	str
220.644811	1998	"C"
966.183262	2006	"C"
295.737178	2010	"A"
233.546019	2009	"C"
2392.394417	2022	"B"
…	…	…
373.652761	2011	"A"
384.053786	2010	"B"
1388.573406	1999	"A"
1225.981395	2007	"B"
1206.351218	2022	"B"

DataFrames come with several attributes for exploration:

# Get rows number
flights.height

# Get columns number
flights.width

# Get a list of column names
flights.columns[:10]

['Year',
 'Quarter',
 'Month',
 'DayofMonth',
 'DayOfWeek',
 'FlightDate',
 'Reporting_Airline',
 'DOT_ID_Reporting_Airline',
 'IATA_CODE_Reporting_Airline',
 'Tail_Number']

# Get a list of column dtype
flights.dtypes[:10]

[Int64, Int64, Int64, Int64, Int64, String, String, Int64, String, String]

# Get a Schema object mapping column names to their dtype
flights[:,:10].schema

Schema([('Year', Int64),
        ('Quarter', Int64),
        ('Month', Int64),
        ('DayofMonth', Int64),
        ('DayOfWeek', Int64),
        ('FlightDate', String),
        ('Reporting_Airline', String),
        ('DOT_ID_Reporting_Airline', Int64),
        ('IATA_CODE_Reporting_Airline', String),
        ('Tail_Number', String)])

1.3 Summary

At first glance, Polars offers the ease of use reminiscent of R, blending with the familiarity of pandas (minus any potential frustrations).