5  Polars in action

5.1 Why Pandas feels clunky after using R, but Polars doesn’t

The initial demonstration of Polars in action isn’t a real-world scenario but rather a simplified example, deliberately chosen to highlight Polars’ fluent API, which, in my view, resonates with many tidyverse users.

Rasmus Baath, in his insightful blog post, compared the usability of R’s dplyr and Python’s Pandas using a simple data analysis exercise. He argued that Pandas feels clunky and complicated compared to the core package of the tidyverse. I found his example to be compelling because I have had the same personal experience using Pandas for various data tasks at work. I recommend reading his post first to gain context for the subsequent code snippets.

By borrowing Ramus’s case, I aim to showcase Polar’s fluent interface which is arguably as silky smooth as tidyverse’s. While this example carries biased opinion, it serves as an useful illustration of what a good API design should look like. Pandas remains an excellent package, but it’s important to acknowledge its quirks, especially for newcomers.

5.1.1 Reading data

To begin, let’s take a look at the dataset, which consists of a table of purchases detailing different amount and discount applied in various country.

# R
require(readr)
purchases = readr::read_csv("./data/purchases.csv")
purchases |> head()
# A tibble: 6 × 3
  country amount discount
  <chr>    <dbl>    <dbl>
1 USA       2000       10
2 USA       3500       15
3 USA       3000       20
4 Canada     120       12
5 Canada     180       18
6 Canada    3100       21
# Pandas
import pandas as pd
purchases_pd = pd.read_csv("./data/purchases.csv")
purchases_pd.head()
  country  amount  discount
0     USA    2000        10
1     USA    3500        15
2     USA    3000        20
3  Canada     120        12
4  Canada     180        18
# Polars
import polars as pl
purchases_pl = pl.read_csv("./data/purchases.csv")
pl.Config.set_tbl_formatting('NOTHING')
<class 'polars.config.Config'>
pl.Config.set_tbl_hide_column_data_types(True)
<class 'polars.config.Config'>
purchases_pl.head()
shape: (5, 3)
country amount discount
"USA" 2000 10
"USA" 3500 15
"USA" 3000 20
"Canada" 120 12
"Canada" 180 18

5.1.2 “How much do we sell..? Let’s take the total sum!”

There’s nothing noteworthy here. In R, we use the sum() function, whereas in Pandas and Polars, we use the Series.sum() method for similar operation.

# R
purchases$amount |> sum()
[1] 17210
# Pandas
purchases_pd["amount"].sum()
17210
# Polars
purchases_pl['amount'].sum()
17210

5.1.3 “Ah, they wanted it by country…”

Some subtle yet significant differences emerge:

  • In R and Polars, the output consistenly remains as a DataFrame. However, in Pandas, this isn’t guaranteed, so the use of reset_index().

  • R’s and Poars’ solutions are typically present themselves straightforwardly, while in Pandas, this isn’t always the case (why resetting index? why passing sum method as a string?)

  • Polar’s version may appear more verbose, but this is a design choice. And it’s hard to beat dplyr’s clarity.

# R
purchases |>
  dplyr::group_by(country) |>
  dplyr::summarize(total = sum(amount))
# A tibble: 11 × 2
   country   total
   <chr>     <dbl>
 1 Australia   600
 2 Brazil      460
 3 Canada     3400
 4 France      500
 5 Germany     570
 6 India       720
 7 Italy       630
 8 Japan       690
 9 Spain       660
10 UK          480
11 USA        8500
# Pandas
(purchases_pd
  .groupby("country")
  .agg(total=("amount", "sum")) 
  .reset_index()                
)
      country  total
0   Australia    600
1      Brazil    460
2      Canada   3400
3      France    500
4     Germany    570
5       India    720
6       Italy    630
7       Japan    690
8       Spain    660
9          UK    480
10        USA   8500
# Polars
(purchases_pl
  .group_by('country')
  .agg(total=pl.col('amount').sum())
  .sort('country')
)
shape: (11, 2)
country total
"Australia" 600
"Brazil" 460
"Canada" 3400
"France" 500
"Germany" 570
"Italy" 630
"Japan" 690
"Spain" 660
"UK" 480
"USA" 8500

5.1.4 “And I guess I should deduct the discount.”

In this task, Polars closely resembles the intuitive syntax of dplyr. However, achieving similar operation in Pandas requires a large change in code structure.

# R
purchases |> 
  dplyr::group_by(country) |> 
  dplyr::summarize(total = sum(amount - discount))
# A tibble: 11 × 2
   country   total
   <chr>     <dbl>
 1 Australia   540
 2 Brazil      414
 3 Canada     3349
 4 France      450
 5 Germany     513
 6 India       648
 7 Italy       567
 8 Japan       621
 9 Spain       594
10 UK          432
11 USA        8455
# Pandas
(purchases_pd
  .groupby("country")
  .apply(lambda df: (df["amount"] - df["discount"]).sum())
  .reset_index()
  .rename(columns={0: "total"})                           
)
      country  total
0   Australia    540
1      Brazil    414
2      Canada   3349
3      France    450
4     Germany    513
5       India    648
6       Italy    567
7       Japan    621
8       Spain    594
9          UK    432
10        USA   8455
# Polars
(purchases_pl
  .group_by('country')
  .agg(
    total=(pl.col('amount') - pl.col('discount')).sum()
  )
  .sort('country')
)
shape: (11, 2)
country total
"Australia" 540
"Brazil" 414
"Canada" 3349
"France" 450
"Germany" 513
"Italy" 567
"Japan" 621
"Spain" 594
"UK" 432
"USA" 8455

5.1.5 “Oh, and Maria asked me to remove any outliers.”

Now Pandas deviates from its usual syntax with the introduction of query() method. Meanwhile, Polars maintains its fluency, demonstrating its consistency very well.

# R
purchases |>
  dplyr::filter(amount <= median(amount) * 10) |>
  dplyr::group_by(country) |> 
  dplyr::summarize(total = sum(amount - discount))
# A tibble: 11 × 2
   country   total
   <chr>     <dbl>
 1 Australia   540
 2 Brazil      414
 3 Canada      270
 4 France      450
 5 Germany     513
 6 India       648
 7 Italy       567
 8 Japan       621
 9 Spain       594
10 UK          432
11 USA        1990
# Pandas
(purchases_pd
  .query("amount <= amount.median() * 10")
  .groupby("country")
  .apply(lambda df: (df["amount"] - df["discount"]).sum())
  .reset_index()
  .rename(columns={0: "total"})
)
      country  total
0   Australia    540
1      Brazil    414
2      Canada    270
3      France    450
4     Germany    513
5       India    648
6       Italy    567
7       Japan    621
8       Spain    594
9          UK    432
10        USA   1990
# Polars
(purchases_pl
  .filter(
    pl.col('amount') <= (pl.col('amount').median() * 10)
  )
  .group_by('country')
  .agg(
    total=(pl.col('amount') - pl.col('discount')).sum()
  )
  .sort('country')
)
shape: (11, 2)
country total
"Australia" 540
"Brazil" 414
"Canada" 270
"France" 450
"Germany" 513
"Italy" 567
"Japan" 621
"Spain" 594
"UK" 432
"USA" 1990

5.1.6 “I probably should use the median within each country”

In this last task, Polars shines in readability thanks to its window function API, which makes the flow of data processing explicit. On the other hand, dplyr’s solution might be a bit trickier to grasp as it involves working with grouped data frame.

# R 
purchases |>
  dplyr::group_by(country) |>                     
  dplyr::filter(amount <= median(amount) * 10) |> 
  dplyr::summarize(total = sum(amount - discount))
# A tibble: 11 × 2
   country   total
   <chr>     <dbl>
 1 Australia   540
 2 Brazil      414
 3 Canada      270
 4 France      450
 5 Germany     513
 6 India       648
 7 Italy       567
 8 Japan       621
 9 Spain       594
10 UK          432
11 USA        8455
# Pandas
(purchases_pd
  .groupby("country")                                               
  .apply(lambda df: df[df["amount"] <= df["amount"].median() * 10]) 
  .reset_index(drop=True)                                           
  .groupby("country")
  .apply(lambda df: (df["amount"] - df["discount"]).sum())
  .reset_index()
  .rename(columns={0: "total"})
)
      country  total
0   Australia    540
1      Brazil    414
2      Canada    270
3      France    450
4     Germany    513
5       India    648
6       Italy    567
7       Japan    621
8       Spain    594
9          UK    432
10        USA   8455
# Polars
(purchases_pl
  .filter(
    pl.col('amount') <= (pl.col('amount').median().over('country') * 10)
  )
  .group_by('country')
  .agg(
    total=(pl.col('amount') - pl.col('discount')).sum()
  )
  .sort('country')
)
shape: (11, 2)
country total
"Australia" 540
"Brazil" 414
"Canada" 270
"France" 450
"Germany" 513
"Italy" 567
"Japan" 621
"Spain" 594
"UK" 432
"USA" 8455