1  First steps with Polars

1.1 Installation

Let’s kick off our journey into the world of data manipulation with the Polars library. First things first, we need to set up a project with uv, a fast Python package manager. Install the latest stable version with:

uv add polars

Depending on your use case, you might want to install the optional dependencies as well:

uv add 'polars[numpy,pandas,pyarrow]'

To gain insights into the installed Polars package, including version details and enabled features, utilize the show_versions() function:

import polars as pl
pl.show_versions()
--------Version info---------
Polars:              1.40.1
Index type:          UInt32
Platform:            Linux-6.8.0-111-generic-x86_64-with-glibc2.39
Python:              3.12.3 (main, Mar 23 2026, 19:04:32) [GCC 13.3.0]
Runtime:             rt32

----Optional dependencies----
Azure CLI            <not installed>
adbc_driver_manager  <not installed>
altair               <not installed>
azure.identity       <not installed>
boto3                <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               <not installed>
gevent               <not installed>
google.auth          <not installed>
great_tables         <not installed>
matplotlib           <not installed>
numpy                2.4.4
openpyxl             <not installed>
pandas               3.0.2
polars_cloud         <not installed>
pyarrow              24.0.0
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>

1.2 Initial exploration

For better presentation, this book use Polars’ configuration options to adjust the layout of printed tables. Specifically:

  • Limit the number of columns and rows displayed in DataFrames to 10 each
  • Apply stylish formatting to the tables for pleasant reading experience
pl.Config.set_tbl_cols(10)
pl.Config.set_tbl_rows(10)
pl.Config.set_tbl_formatting(rounded_corners=True)
polars.config.Config

1.2.1 Data loading

Reading a CSV file using Polars is straightforward. Let’s take a quick look:

flights = pl.read_csv('./data/flights.csv')

For those familiar with R’s dplyr, a similar method called glimpse() is available:

# Glimpse the first 10 columns
flights[:,0:10].glimpse()
Rows: 999
Columns: 10
$ Year                        <i64> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022
$ Quarter                     <i64> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
$ Month                       <i64> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
$ DayofMonth                  <i64> 14, 15, 16, 17, 18, 19, 20, 21, 22, 23
$ DayOfWeek                   <i64> 5, 6, 7, 1, 2, 3, 4, 5, 6, 7
$ FlightDate                  <str> '2022-01-14', '2022-01-15', '2022-01-16', '2022-01-17', '2022-01-18', '2022-01-19', '2022-01-20', '2022-01-21', '2022-01-22', '2022-01-23'
$ Reporting_Airline           <str> 'YX', 'YX', 'YX', 'YX', 'YX', 'YX', 'YX', 'YX', 'YX', 'YX'
$ DOT_ID_Reporting_Airline    <i64> 20452, 20452, 20452, 20452, 20452, 20452, 20452, 20452, 20452, 20452
$ IATA_CODE_Reporting_Airline <str> 'YX', 'YX', 'YX', 'YX', 'YX', 'YX', 'YX', 'YX', 'YX', 'YX'
$ Tail_Number                 <str> 'N119HQ', 'N122HQ', 'N412YX', 'N405YX', 'N420YX', 'N446YX', 'N116HQ', 'N419YX', 'N137HQ', 'N110HQ'


Standard commands from pandas such as head(), tail(), and describe() can be used seamlessly:

flights.head()
shape: (5, 110)
Year Quarter Month DayofMonth DayOfWeek Div5TotalGTime Div5LongestGTime Div5WheelsOff Div5TailNum
i64 i64 i64 i64 i64 str str str str str
2022 1 1 14 5 null null "" "" null
2022 1 1 15 6 null null "" "" null
2022 1 1 16 7 null null "" "" null
2022 1 1 17 1 null null "" "" null
2022 1 1 18 2 null null "" "" null
flights.tail()
shape: (5, 110)
Year Quarter Month DayofMonth DayOfWeek Div5TotalGTime Div5LongestGTime Div5WheelsOff Div5TailNum
i64 i64 i64 i64 i64 str str str str str
2022 1 1 12 3 null null "" "" null
2022 1 1 13 4 null null "" "" null
2022 1 1 14 5 null null "" "" null
2022 1 1 17 1 null null "" "" null
2022 1 1 18 2 null null "" "" null
flights.describe()
shape: (9, 111)
statistic Year Quarter Month DayofMonth Div5TotalGTime Div5LongestGTime Div5WheelsOff Div5TailNum
str f64 f64 f64 f64 str str str str str
"count" 999.0 999.0 999.0 999.0 "0" "0" "999" "999" "0"
"null_count" 0.0 0.0 0.0 0.0 "999" "999" "0" "0" "999"
"mean" 2022.0 1.0 1.0 16.2002 null null null null null
"std" 0.0 0.0 0.0 8.802666 null null null null null
"min" 2022.0 1.0 1.0 1.0 null null "" "" null
"25%" 2022.0 1.0 1.0 9.0 null null null null null
"50%" 2022.0 1.0 1.0 16.0 null null null null null
"75%" 2022.0 1.0 1.0 24.0 null null null null null
"max" 2022.0 1.0 1.0 31.0 null null "" "" null


If you want to take a peek at different parts of your DataFrame, here’s a handy trick: use the sample() method. This method randomly picks n number of rows from the DataFrame and returns them for inspection.

flights.sample(3)
shape: (3, 110)
Year Quarter Month DayofMonth DayOfWeek Div5TotalGTime Div5LongestGTime Div5WheelsOff Div5TailNum
i64 i64 i64 i64 i64 str str str str str
2022 1 1 1 6 null null "" "" null
2022 1 1 21 5 null null "" "" null
2022 1 1 7 5 null null "" "" null


The output from Polars comes with some useful features:

  • Underneath each column name is a data type.

  • No index numbers are present.

  • String values are quoted with double quotes.

  • Missing values are represented as null, applicable to all data types.

1.2.2 Row and column counting

Determining the number of rows and columns in a Polars DataFrame is as simple as checking the shape:

flights.shape
(999, 110)

1.2.3 Converting from pandas

Transitioning from a Pandas DataFrame to a Polars DataFrame is effortless with the from_pandas() method:

For efficient zero-copy conversion, both Pandas and Polars rely on the pyarrow library as an interchange format.

import pandas as pd
flights2 = pl.from_pandas(pd.read_csv('./data/flights.csv'))

flights2[:,0:9].glimpse()
Rows: 999
Columns: 9
$ Year                        <i64> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022
$ Quarter                     <i64> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
$ Month                       <i64> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
$ DayofMonth                  <i64> 14, 15, 16, 17, 18, 19, 20, 21, 22, 23
$ DayOfWeek                   <i64> 5, 6, 7, 1, 2, 3, 4, 5, 6, 7
$ FlightDate                  <str> '2022-01-14', '2022-01-15', '2022-01-16', '2022-01-17', '2022-01-18', '2022-01-19', '2022-01-20', '2022-01-21', '2022-01-22', '2022-01-23'
$ Reporting_Airline           <str> 'YX', 'YX', 'YX', 'YX', 'YX', 'YX', 'YX', 'YX', 'YX', 'YX'
$ DOT_ID_Reporting_Airline    <i64> 20452, 20452, 20452, 20452, 20452, 20452, 20452, 20452, 20452, 20452
$ IATA_CODE_Reporting_Airline <str> 'YX', 'YX', 'YX', 'YX', 'YX', 'YX', 'YX', 'YX', 'YX', 'YX'

1.2.4 Understanding data structure

The fundamental data structures in Polars are Series and DataFrames:

  • Series is 1-dimensional data structure, akin to R’s atomic vector, where all elements must share the same data type.
# Create a named Series
s = pl.Series('a', [1, 2, 3, 2, 5])
s
shape: (5,)
a
i64
1
2
3
2
5
# Note that dtype of `s` is automatically inferred as Int64
s.dtype
Int64

Constructing a Series with a specific dtype:

s2 = pl.Series('a', [1, 2, 3], dtype=pl.Float32)
s2
shape: (3,)
a
f32
1.0
2.0
3.0
  • Series provides a wide range of methods for various operations, including standard statistical functions like .max(), .mean(), as well as specialized ones such as .entropy() and .unique_counts().
print(s.max())
print(s.mean())
5
2.6
s.unique_counts()
shape: (4,)
a
u32
1
2
1
1
  • DataFrames are 2-dimensional structures, similar to R’s data.frame, built on top of Series. In the examples below, borrowed from RealPython, we create DataFrame using two different approaches:

Creating DataFrame from a dictionary:

import numpy as np
num_rows = 5000
rng = np.random.default_rng(seed=7)

buildings_data = {
     "sqft": rng.exponential(scale=1000, size=num_rows),
     "year": rng.integers(low=1995, high=2023, size=num_rows),
     "building_type": rng.choice(["A", "B", "C"], size=num_rows),
 }
buildings = pl.DataFrame(buildings_data)
buildings
shape: (5_000, 3)
sqft year building_type
f64 i64 str
707.529256 1996 "C"
1025.203348 2020 "C"
568.548657 2012 "A"
895.109864 2000 "A"
206.532754 2011 "A"
710.435755 2003 "C"
408.872783 2009 "C"
57.562059 2019 "C"
3728.088949 2020 "C"
686.678345 2011 "C"

Creating DataFrame from multiple Series:

s1 = pl.Series('sqrf', rng.exponential(scale=1000, size=num_rows))
s2 = pl.Series('year', rng.integers(low=1995, high=2023, size=num_rows))
s3 = pl.Series('building_type', rng.choice(["A", "B", "C"], size=num_rows))

buildings2 = pl.DataFrame([s1, s2, s3])
buildings2
shape: (5_000, 3)
sqrf year building_type
f64 i64 str
220.644811 1998 "C"
966.183262 2006 "C"
295.737178 2010 "A"
233.546019 2009 "C"
2392.394417 2022 "B"
373.652761 2011 "A"
384.053786 2010 "B"
1388.573406 1999 "A"
1225.981395 2007 "B"
1206.351218 2022 "B"


DataFrames come with several attributes for exploration:

# Get rows number
flights.height
999
# Get columns number
flights.width
110
# Get a list of column names
flights.columns[:10]
['Year',
 'Quarter',
 'Month',
 'DayofMonth',
 'DayOfWeek',
 'FlightDate',
 'Reporting_Airline',
 'DOT_ID_Reporting_Airline',
 'IATA_CODE_Reporting_Airline',
 'Tail_Number']
# Get a list of column dtype
flights.dtypes[:10]
[Int64, Int64, Int64, Int64, Int64, String, String, Int64, String, String]
# Get a Schema object mapping column names to their dtype
flights[:,:10].schema
Schema([('Year', Int64),
        ('Quarter', Int64),
        ('Month', Int64),
        ('DayofMonth', Int64),
        ('DayOfWeek', Int64),
        ('FlightDate', String),
        ('Reporting_Airline', String),
        ('DOT_ID_Reporting_Airline', Int64),
        ('IATA_CODE_Reporting_Airline', String),
        ('Tail_Number', String)])

1.3 Summary

At first glance, Polars offers the ease of use reminiscent of R, blending with the familiarity of pandas (minus any potential frustrations).