Polars Expressions & Contexts: The Iris Edition

A high-density guide to Polars’ expression engine, showcasing complex transformations across all execution contexts.
Python
Data Science
Polars
Author

bwrob

Published

April 5, 2026

Polars separates Expressions (the what) from Contexts (the where). This architecture allows the engine to optimize queries globally before execution.

1. Loading the Data

import polars as pl
import polars.selectors as cs
import seaborn as sns
from datetime import datetime, timedelta

# Load from seaborn and convert
df = pl.from_pandas(sns.load_dataset('iris'))
df.head(3)
shape: (3, 5)
sepal_length sepal_width petal_length petal_width species
f64 f64 f64 f64 str
5.1 3.5 1.4 0.2 "setosa"
4.9 3.0 1.4 0.2 "setosa"
4.7 3.2 1.3 0.2 "setosa"

2. Expressions: The Building Blocks

Expressions are the atomic units of logic. They are lazily evaluated and can be chained indefinitely.

# Complex logic: normalization, clipping, and casting
expr = (
    (pl.col("sepal_length") - pl.col("sepal_length").mean())
    / pl.col("sepal_length").std()
).clip(-1, 1).cast(pl.Float32)

# Column-level string/date operations
species_upper = pl.col("species").str.to_uppercase()

3. Power Selectors (polars.selectors)

Selectors target columns based on their properties, types, or names.

# Target numeric columns that contain 'width'
df.select(cs.numeric() & cs.contains("width"))

# Apply a transformation to all numeric columns except 'sepal_length'
df.select(
    (cs.numeric() - cs.by_name("sepal_length")) * 100
).head(3)
shape: (3, 3)
sepal_width petal_length petal_width
f64 f64 f64
350.0 140.0 20.0
300.0 140.0 20.0
320.0 130.0 20.0

4. Context: Selection (select)

In the select context, expressions must return a result of the same length or a scalar (which is broadcasted).

df.select(
    # Simple selection
    "species",
    # Multiple aggregations as scalars (broadcasted)
    mean_petal = pl.col("petal_length").mean(),
    # Sorting one column based on another within select
    sorted_sepal = pl.col("sepal_length").sort_by("petal_width")
).head(3)
shape: (3, 3)
species mean_petal sorted_sepal
str f64 f64
"setosa" 3.758 4.9
"setosa" 3.758 4.8
"setosa" 3.758 4.3

5. Context: Modification (with_columns)

Used to append new columns or overwrite existing ones while keeping the rest of the DataFrame.

df.with_columns(
    # Complex boolean logic
    is_large = (pl.col("sepal_length") > 5) & (pl.col("petal_length") > 1.5),
    # Row-wise maximum across numeric columns
    max_dim = pl.max_horizontal(cs.numeric()),
    # Renaming on the fly
    sepal_sum = pl.col("sepal_length") + pl.col("sepal_width")
).head(3)
shape: (3, 8)
sepal_length sepal_width petal_length petal_width species is_large max_dim sepal_sum
f64 f64 f64 f64 str bool f64 f64
5.1 3.5 1.4 0.2 "setosa" false 5.1 8.6
4.9 3.0 1.4 0.2 "setosa" false 4.9 7.9
4.7 3.2 1.3 0.2 "setosa" false 4.7 7.9

6. Context: Aggregation (group_by)

Expressions in agg are evaluated independently for each group.

df.group_by("species").agg(
    # Count rows
    pl.len().alias("count"),
    # Multiple aggregations on multiple columns via selectors
    cs.numeric().mean().name.prefix("avg_"),
    # Aggregation into a list (very powerful)
    petal_lengths = pl.col("petal_length").implode(),
    # Complex: Mean of top 3 values per group
    top_3_mean = pl.col("sepal_length").sort(descending=True).head(3).mean()
)
shape: (3, 8)
species count avg_sepal_length avg_sepal_width avg_petal_length avg_petal_width petal_lengths top_3_mean
str u32 f64 f64 f64 f64 list[f64] f64
"versicolor" 50 5.936 2.77 4.26 1.326 [4.7, 4.5, … 4.1] 6.9
"setosa" 50 5.006 3.428 1.462 0.246 [1.4, 1.4, … 1.4] 5.733333
"virginica" 50 6.588 2.974 5.552 2.026 [6.0, 5.1, … 5.1] 7.766667

7. Window Functions (over)

Window functions allow you to perform group-level calculations without collapsing the rows.

df.with_columns(
    # Rank within species based on petal length
    rank_in_species = pl.col("petal_length").rank("dense", descending=True).over("species"),
    # Diff from group mean
    diff_from_group = pl.col("sepal_length") - pl.col("sepal_length").mean().over("species"),
    # Cumulative sum within species
    cum_sum = pl.col("sepal_length").cum_sum().over("species")
).head(3)
shape: (3, 8)
sepal_length sepal_width petal_length petal_width species rank_in_species diff_from_group cum_sum
f64 f64 f64 f64 str u32 f64 f64
5.1 3.5 1.4 0.2 "setosa" 5 0.094 5.1
4.9 3.0 1.4 0.2 "setosa" 5 -0.106 10.0
4.7 3.2 1.3 0.2 "setosa" 6 -0.306 14.7

8. Complex Types: Lists & Structs

Polars has first-class support for nested data.

# Create a struct (row-wise grouping of columns)
df_struct = df.with_columns(
    dims = pl.struct("sepal_length", "sepal_width")
)

# Extracting from a struct
df_struct.select(
    pl.col("dims").struct.field("sepal_length")
).head(3)

# List mastery: eval() allows running full expressions inside a list column
df.group_by("species").agg(
    pl.col("petal_length").implode()
).with_columns(
    # For every list: sort, take top 2, and mean
    top_2_avg = pl.col("petal_length").list.eval(
        pl.element().sort(descending=True).head(2).mean()
    )
)
shape: (3, 3)
species petal_length top_2_avg
str list[f64] list[f64]
"setosa" [1.4, 1.4, … 1.4] [1.9]
"virginica" [6.0, 5.1, … 5.1] [6.8]
"versicolor" [4.7, 4.5, … 4.1] [5.05]

9. Time Series Mastery: asof & Rolling

Polars is blazingly fast for time-series.

# Setup time data
ts_df = df.with_columns(
    time = pl.datetime_range(datetime(2023, 1, 1), datetime(2023, 1, 1) + timedelta(days=149), "1d", eager=True)
)

# Rolling window: 3-day mean of petal length
ts_df.with_columns(
    rolling_mean = pl.col("petal_length").rolling_mean_by(window_size="3d", by="time")
).head(5)

# Join Asof: join on closest match (backwards/forward/nearest)
quotes = pl.DataFrame({
    "time": [datetime(2023, 1, 1, 10, 0), datetime(2023, 1, 1, 10, 5)],
    "price": [100.0, 101.0]
})
trades = pl.DataFrame({"time": [datetime(2023, 1, 1, 10, 2)]})

trades.join_asof(quotes, on="time", strategy="backward")
shape: (1, 2)
time price
datetime[μs] f64
2023-01-01 10:02:00 100.0

10. Reshaping: pivot and unpivot

Convert data between wide and long formats.

# Wide to Long (unpivot/melt)
long_df = df.unpivot(index=["species"], on=cs.numeric())

# Long to Wide (pivot)
long_df.pivot(on="variable", values="value", index="species", aggregate_function="mean")
shape: (3, 5)
species sepal_length sepal_width petal_length petal_width
str f64 f64 f64 f64
"setosa" 5.006 3.428 1.462 0.246
"versicolor" 5.936 2.77 4.26 1.326
"virginica" 6.588 2.974 5.552 2.026

11. Set Operations: semi and anti joins

Filter one DataFrame by another without duplicating rows or merging columns.

# Only Setosa (Semi join acts as a filter)
species_filter = pl.DataFrame({"species": ["setosa"]})
df.join(species_filter, on="species", how="semi").head(3)

# Everything BUT Setosa (Anti join)
df.join(species_filter, on="species", how="anti").head(3)
shape: (3, 5)
sepal_length sepal_width petal_length petal_width species
f64 f64 f64 f64 str
7.0 3.2 4.7 1.4 "versicolor"
6.4 3.2 4.5 1.5 "versicolor"
6.9 3.1 4.9 1.5 "versicolor"

12. Optimization: Categorical vs Enum

Strings should almost always be encoded. Enum is fixed and faster; Categorical is dynamic.

# Use Enum for fixed sets (Species is perfect)
species_enum = pl.Enum(["setosa", "versicolor", "virginica"])
df.with_columns(pl.col("species").cast(species_enum)).head(3)
shape: (3, 5)
sepal_length sepal_width petal_length petal_width species
f64 f64 f64 f64 enum
5.1 3.5 1.4 0.2 "setosa"
4.9 3.0 1.4 0.2 "setosa"
4.7 3.2 1.3 0.2 "setosa"

13. Missing Data: Interpolate & Coalesce

# Coalesce: first non-null value from multiple columns
messy = pl.DataFrame({"a": [1, None, None], "b": [None, 2, None], "c": [None, None, 3]})
messy.select(filled = pl.coalesce("a", "b", "c"))

# Interpolate: linear fill for missing numeric data
df_missing = pl.DataFrame({"val": [1.0, None, 3.0]})
df_missing.select(pl.col("val").interpolate())
shape: (3, 1)
val
f64
1.0
2.0
3.0

14. Concatenation: pl.concat

Combine DataFrames vertically, horizontally, or diagonally.

df1 = df.head(2)
df2 = df.tail(2)

# Vertical (stacking)
pl.concat([df1, df2], how="vertical")

# Horizontal (side-by-side)
pl.concat([df1.select("species"), df2.select(cs.numeric())], how="horizontal")
shape: (2, 5)
species sepal_length sepal_width petal_length petal_width
str f64 f64 f64 f64
"setosa" 6.2 3.4 5.4 2.3
"setosa" 5.9 3.0 5.1 1.8

15. Out-of-Core & Partitioning

# Streaming flow for massive data
(
    pl.scan_csv("huge.csv")
    .filter(pl.col("val") > 0)
    .sink_csv("output.csv")
)

# Partitioned writing (Data Lake pattern)
# Writes files into folders like: output/species=setosa/data.parquet
df.write_parquet("output_lake", use_pyarrow=True, pyarrow_options={"partition_cols": ["species"]})
Note

Download the companion script here.

Back to top