5.1 Why Pandas feels clunky after using R, but Polars doesn’t
The initial demonstration of Polars in action isn’t a real-world scenario but rather a simplified example, deliberately chosen to highlight Polars’ fluent API, which, in my view, resonates with many tidyverse users.
Rasmus Baath, in his insightful blog post, compared the usability of R’s dplyr and Python’s Pandas using a simple data analysis exercise. He argued that Pandas feels clunky and complicated compared to the core package of the tidyverse. I found his example to be compelling because I have had the same personal experience using Pandas for various data tasks at work. I recommend reading his post first to gain context for the subsequent code snippets.
By borrowing Ramus’s case, I aim to showcase Polar’s fluent interface which is arguably as silky smooth as tidyverse’s. While this example carries biased opinion, it serves as an useful illustration of what a good API design should look like. Pandas remains an excellent package, but it’s important to acknowledge its quirks, especially for newcomers.
5.1.1 Reading data
To begin, let’s take a look at the dataset, which consists of a table of purchases detailing different amount and discount applied in various country.
# A tibble: 6 × 3
country amount discount
<chr> <dbl> <dbl>
1 USA 2000 10
2 USA 3500 15
3 USA 3000 20
4 Canada 120 12
5 Canada 180 18
6 Canada 3100 21
# Pandasimport pandas as pdpurchases_pd = pd.read_csv("./data/purchases.csv")purchases_pd.head()
country amount discount
0 USA 2000 10
1 USA 3500 15
2 USA 3000 20
3 Canada 120 12
4 Canada 180 18
# Polarsimport polars as plpurchases_pl = pl.read_csv("./data/purchases.csv")pl.Config.set_tbl_formatting('NOTHING')
<class 'polars.config.Config'>
pl.Config.set_tbl_hide_column_data_types(True)
<class 'polars.config.Config'>
purchases_pl.head()
shape: (5, 3)
country
amount
discount
"USA"
2000
10
"USA"
3500
15
"USA"
3000
20
"Canada"
120
12
"Canada"
180
18
5.1.2 “How much do we sell..? Let’s take the total sum!”
There’s nothing noteworthy here. In R, we use the sum() function, whereas in Pandas and Polars, we use the Series.sum() method for similar operation.
# Rpurchases$amount |>sum()
[1] 17210
# Pandaspurchases_pd["amount"].sum()
17210
# Polarspurchases_pl['amount'].sum()
17210
5.1.3 “Ah, they wanted it by country…”
Some subtle yet significant differences emerge:
In R and Polars, the output consistenly remains as a DataFrame. However, in Pandas, this isn’t guaranteed, so the use of reset_index().
R’s and Poars’ solutions are typically present themselves straightforwardly, while in Pandas, this isn’t always the case (why resetting index? why passing sum method as a string?)
Polar’s version may appear more verbose, but this is a design choice. And it’s hard to beat dplyr’s clarity.
# A tibble: 11 × 2
country total
<chr> <dbl>
1 Australia 600
2 Brazil 460
3 Canada 3400
4 France 500
5 Germany 570
6 India 720
7 Italy 630
8 Japan 690
9 Spain 660
10 UK 480
11 USA 8500
country total
0 Australia 600
1 Brazil 460
2 Canada 3400
3 France 500
4 Germany 570
5 India 720
6 Italy 630
7 Japan 690
8 Spain 660
9 UK 480
10 USA 8500
In this task, Polars closely resembles the intuitive syntax of dplyr. However, achieving similar operation in Pandas requires a large change in code structure.
# A tibble: 11 × 2
country total
<chr> <dbl>
1 Australia 540
2 Brazil 414
3 Canada 3349
4 France 450
5 Germany 513
6 India 648
7 Italy 567
8 Japan 621
9 Spain 594
10 UK 432
11 USA 8455
country total
0 Australia 540
1 Brazil 414
2 Canada 3349
3 France 450
4 Germany 513
5 India 648
6 Italy 567
7 Japan 621
8 Spain 594
9 UK 432
10 USA 8455
5.1.5 “Oh, and Maria asked me to remove any outliers.”
Now Pandas deviates from its usual syntax with the introduction of query()method. Meanwhile, Polars maintains its fluency, demonstrating its consistency very well.
# A tibble: 11 × 2
country total
<chr> <dbl>
1 Australia 540
2 Brazil 414
3 Canada 270
4 France 450
5 Germany 513
6 India 648
7 Italy 567
8 Japan 621
9 Spain 594
10 UK 432
11 USA 1990
country total
0 Australia 540
1 Brazil 414
2 Canada 270
3 France 450
4 Germany 513
5 India 648
6 Italy 567
7 Japan 621
8 Spain 594
9 UK 432
10 USA 1990
5.1.6 “I probably should use the median within each country”
In this last task, Polars shines in readability thanks to its window function API, which makes the flow of data processing explicit. On the other hand, dplyr’s solution might be a bit trickier to grasp as it involves working with grouped data frame.
# A tibble: 11 × 2
country total
<chr> <dbl>
1 Australia 540
2 Brazil 414
3 Canada 270
4 France 450
5 Germany 513
6 India 648
7 Italy 567
8 Japan 621
9 Spain 594
10 UK 432
11 USA 8455
country total
0 Australia 540
1 Brazil 414
2 Canada 270
3 France 450
4 Germany 513
5 India 648
6 Italy 567
7 Japan 621
8 Spain 594
9 UK 432
10 USA 8455