4  Lazy API

4.1 What is the lazy API?

Polars offers two execution modes that can be used interchangeably:

mode entry point behaviour
Eager pl.read_csv(), pl.DataFrame each operation executes immediately and returns a result
Lazy pl.scan_csv(), pl.LazyFrame operations are recorded as a query plan and deferred until you call .collect()

The distinction is subtle but consequential. In eager mode, every method call immediately processes data and returns a new DataFrame. In lazy mode, calling .filter(), .select(), or .group_by() does nothing more than append a node to a query plan — no data is touched until you explicitly ask for results.

Under the hood, a LazyFrame wraps a logical plan: a tree structure with data sources as leaf nodes and transformations (filter, select, join, etc.) as interior nodes. Before execution, Polars’ query optimiser traverses this tree and applies rewrite rules — such as pushing filters closer to the data source or pruning unused columns — producing an optimised logical plan. The optimiser then converts this into a physical plan that selects specific algorithms (e.g., which join strategy to use). This architecture — parsing → logical plan → optimisation → physical plan → execution — is common to most modern query engines (databases, Spark, DataFusion) and is what enables Polars to outperform naive row-by-row evaluation.

CSV = './data/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_1.csv'

# eager: data is read into memory immediately
df = pl.read_csv(CSV)
type(df)
<class 'polars.dataframe.frame.DataFrame'>
# lazy: only a query plan is created — no data loaded yet
lf = pl.scan_csv(CSV)
type(lf)
<class 'polars.lazyframe.frame.LazyFrame'>

The schema is accessible without loading data:

# inspect column names and types without reading the file
lf.collect_schema().names()[:8]
['Year', 'Quarter', 'Month', 'DayofMonth', 'DayOfWeek', 'FlightDate', 'Reporting_Airline', 'DOT_ID_Reporting_Airline']

4.2 Why laziness makes queries faster

By seeing the whole pipeline before running anything, Polars can rearrange and trim work that would otherwise be wasted. The two most impactful optimisations — predicate pushdown and projection pushdown — both reduce I/O at the scan level. Other optimisations include slice pushdown, common subplan elimination, constant folding, join ordering, type coercion, and cardinality estimation (full list).

4.2.1 Predicate and projection pushdown

Deferring execution lets Polars apply two key optimisations at the file-scan level:

  • Predicate pushdown: filters are moved as early as possible — ideally into the file scan itself. Instead of reading all 537,000 rows and then discarding 95% of them, only the rows that satisfy the predicate are ever loaded.
  • Projection pushdown: only the columns referenced in the query are read from disk. The remaining columns are never decompressed or parsed.

To see the optimiser in action, compare the non-optimised and optimised versions of the same query plan:

q = (
    pl.scan_csv(CSV)
    .filter(pl.col('OriginStateName') == 'Ohio')
    .group_by('Reporting_Airline')
    .agg(pl.col('DepDelay').mean().alias('mean_delay'))
    .sort('mean_delay', descending=True)
)

# non-optimised plan: operations appear in the order they were written
print(q.explain(optimized=False))
SORT BY [descending: [true]] [col("mean_delay")]
  AGGREGATE[maintain_order: false]
    [col("DepDelay").mean().alias("mean_delay")] BY [col("Reporting_Airline")]
    FROM
    FILTER [(col("OriginStateName")) == ("Ohio")]
    FROM
      Csv SCAN [./data/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_1.csv]
      PROJECT */110 COLUMNS
      ESTIMATED ROWS: 547857

Without optimisation, the plan mirrors the original code structure: scan all 110 columns, then filter, then group, then sort. Every column is loaded from disk, and the filter runs after the full CSV is in memory.

# optimised plan: the optimiser has rewritten the plan for efficiency
print(q.explain())
SORT BY [descending: [true]] [col("mean_delay")]
  AGGREGATE[maintain_order: false]
    [col("DepDelay").mean().alias("mean_delay")] BY [col("Reporting_Airline")]
    FROM
    simple π 2/2 ["DepDelay", ... 1 other column]
      Csv SCAN [./data/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_1.csv]
      PROJECT 3/110 COLUMNS
      SELECTION: [(col("OriginStateName")) == ("Ohio")]
      ESTIMATED ROWS: 547857

The optimised plan tells a different story:

  • PROJECT 3/110 COLUMNS — only the 3 columns needed by the rest of the pipeline are read from disk (projection pushdown)
  • SELECTION: [(col("OriginStateName")) == ("Ohio")] — the filter is pushed into the scan, so rows that don’t match are never loaded (predicate pushdown)

4.2.2 Triggering execution with collect()

A LazyFrame computes nothing until you call .collect(), which executes the optimised plan and returns a DataFrame:

result = q.collect()
result
shape: (15, 2)
Reporting_Airline mean_delay
str f64
"B6" 41.673077
"YV" 30.794643
"AS" 24.833333
"F9" 22.518634
"OO" 19.089655
"WN" 8.159069
"UA" 7.806041
"9E" 6.43956
"YX" 4.939932
"DL" 4.83871

The full pipeline — scan, filter, group-by, aggregate, sort — executes as a single optimised pass.

4.2.2.1 Eager vs lazy: measured speedup

Running the same filter + group-by + aggregate pipeline in both modes on the flights dataset:

import time

start = time.perf_counter()
eager_result = (
    pl.read_csv(CSV)
    .filter(pl.col('OriginStateName') == 'Ohio')
    .group_by('Reporting_Airline')
    .agg(pl.col('DepDelay').mean().alias('mean_delay'))
)
eager_time = time.perf_counter() - start

start = time.perf_counter()
lazy_result = (
    pl.scan_csv(CSV)
    .filter(pl.col('OriginStateName') == 'Ohio')
    .group_by('Reporting_Airline')
    .agg(pl.col('DepDelay').mean().alias('mean_delay'))
    .collect()
)
lazy_time = time.perf_counter() - start

print(f"Eager:   {eager_time:.3f}s")
Eager:   0.497s
print(f"Lazy:    {lazy_time:.3f}s")
Lazy:    0.129s
print(f"Speedup: {eager_time / lazy_time:.1f}x")
Speedup: 3.9x

The lazy pipeline is roughly 4× faster on this 537k-row file — and the advantage grows with file size and pipeline complexity, because the optimiser has more to work with.

4.3 Comparison with R / tidyverse

The closest mental model for R users is dbplyr: a pipeline of dplyr verbs built against a database connection does nothing until collect() is called, at which point the whole chain is translated to SQL and sent to the database — and it is the database engine that optimises and executes the query. Polars lazy mode follows the same deferred pattern, but the optimiser is Polars itself rather than an external database: no SQL translation, no network round-trip, just Polars rewriting and executing the plan locally against your files.

The key difference is that dplyr itself (against in-memory data frames) is eager: each verb executes immediately and returns a result. The same is true for data.table. Lazy optimisation for in-memory and file-based data in R is newer territory.

vectra is a recent addition to the R ecosystem that fills exactly this gap — a columnar query engine providing dplyr-style lazy execution on its own binary format (.vtr). Like Polars, it defers computation until collect(), applies predicate pushdown and column pruning at the scan level, and processes data in fixed-size batches with constant memory:

# vectra: lazy scan of a .vtr file with pushdown (same dataset as Polars example)
tm <- system.time({
  result <- tbl("data/ontime_2022_1.vtr") |>
      filter(OriginStateName == "Ohio") |>
      group_by(Reporting_Airline) |>
      summarise(mean_delay = mean(DepDelay, na.rm = TRUE)) |>
      collect()
})
print(result)
   Reporting_Airline mean_delay
1                 9E   6.439560
2                 AA  17.898406
3                 AS  24.833333
4                 B6  41.673077
5                 DL   4.838710
6                 F9  22.518634
7                 G4  14.244444
8                 MQ   9.456989
9                 NK  13.851351
10                OH  14.047386
11                OO  19.089655
12                UA   7.806041
13                WN   8.159069
14                YV  30.794643
15                YX   4.939932
cat(sprintf("vectra lazy:  %.3fs\n", tm["elapsed"]))
vectra lazy:  0.169s
# examine the optimised query plan
tbl("data/ontime_2022_1.vtr") |>
    filter(OriginStateName == "Ohio") |>
    select(Reporting_Airline, DepDelay) |>
    explain()
vectra execution plan

ProjectNode [streaming] 
  FilterNode [streaming] 
    ScanNode [streaming, 3/110 cols (pruned), predicate pushdown, tdc stats] 

Output columns (2):
  Reporting_Airline <string>
  DepDelay <double>

4.4 When to use the lazy API

Use lazy (scan + collect) when:

  • Reading from files (CSV, Parquet, JSON) — pushdown avoids reading unnecessary rows and columns from disk
  • Running multi-step pipelines (filter → select → group_by → agg → sort) — the optimiser composes the rewrites across the whole chain
  • Working with datasets that approach or exceed available RAM
  • Production pipelines where throughput matters

Use eager (read + DataFrame) when:

  • Doing interactive exploration where you want to inspect intermediate results
  • Working with small datasets where the optimisation overhead is irrelevant
  • Using operations that are eager-only, such as pivot(), which requires knowing the full schema before it can determine output column names

The recommended pattern is to start lazy, accumulate transformations, and call .collect() exactly once at the end:

# recommended: one scan, one collect
result = (
    pl.scan_csv(CSV)
    .filter(pl.col('Cancelled') == 0)
    .with_columns(
        pl.col('FlightDate').str.to_date('%Y-%m-%d'),
        pl.col('DepDelay').fill_null(0)
    )
    .group_by('Reporting_Airline')
    .agg(
        n_flights  = pl.len(),
        mean_delay = pl.col('DepDelay').mean()
    )
    .sort('mean_delay', descending=True)
    .collect()
)

If you need to pass through an eager-only operation mid-pipeline, collect, apply it, then call .lazy() to resume lazy execution:

# pivot is eager-only; collect → pivot → lazy to continue
result = (
    pl.scan_csv(CSV)
    .group_by(['Reporting_Airline', 'DayOfWeek'])
    .agg(n_flights=pl.len())
    .collect()                          # materialise for pivot
    .pivot(on='DayOfWeek', index='Reporting_Airline', values='n_flights', aggregate_function=None)
    .lazy()                             # resume lazy for further steps
    .sort('Reporting_Airline')
    .collect()
)

4.5 Converting between eager and lazy

Any DataFrame can be converted to a LazyFrame with .lazy(), and any LazyFrame to a DataFrame with .collect():

# DataFrame → LazyFrame
df = pl.read_csv(CSV)
lf = df.lazy()

# LazyFrame → DataFrame
df_again = lf.collect()
print(type(lf), "→", type(df_again))
<class 'polars.lazyframe.frame.LazyFrame'> → <class 'polars.dataframe.frame.DataFrame'>

This makes it easy to mix lazy and eager code: start eager for exploration, switch to lazy for the final pipeline, or vice versa.