| sepal_length | sepal_width | petal_length | petal_width | species |
|---|---|---|---|---|
| f64 | f64 | f64 | f64 | str |
| 5.1 | 3.5 | 1.4 | 0.2 | "setosa" |
| 4.9 | 3.0 | 1.4 | 0.2 | "setosa" |
| 4.7 | 3.2 | 1.3 | 0.2 | "setosa" |
Polars separates Expressions (the what) from Contexts (the where). This architecture allows the engine to optimize queries globally before execution.
1. Loading the Data
2. Expressions: The Building Blocks
Expressions are the atomic units of logic. They are lazily evaluated and can be chained indefinitely.
3. Power Selectors (polars.selectors)
Selectors target columns based on their properties, types, or names.
| sepal_width | petal_length | petal_width |
|---|---|---|
| f64 | f64 | f64 |
| 350.0 | 140.0 | 20.0 |
| 300.0 | 140.0 | 20.0 |
| 320.0 | 130.0 | 20.0 |
4. Context: Selection (select)
In the select context, expressions must return a result of the same length or a scalar (which is broadcasted).
| species | mean_petal | sorted_sepal |
|---|---|---|
| str | f64 | f64 |
| "setosa" | 3.758 | 4.9 |
| "setosa" | 3.758 | 4.8 |
| "setosa" | 3.758 | 4.3 |
5. Context: Modification (with_columns)
Used to append new columns or overwrite existing ones while keeping the rest of the DataFrame.
| sepal_length | sepal_width | petal_length | petal_width | species | is_large | max_dim | sepal_sum |
|---|---|---|---|---|---|---|---|
| f64 | f64 | f64 | f64 | str | bool | f64 | f64 |
| 5.1 | 3.5 | 1.4 | 0.2 | "setosa" | false | 5.1 | 8.6 |
| 4.9 | 3.0 | 1.4 | 0.2 | "setosa" | false | 4.9 | 7.9 |
| 4.7 | 3.2 | 1.3 | 0.2 | "setosa" | false | 4.7 | 7.9 |
6. Context: Aggregation (group_by)
Expressions in agg are evaluated independently for each group.
df.group_by("species").agg(
# Count rows
pl.len().alias("count"),
# Multiple aggregations on multiple columns via selectors
cs.numeric().mean().name.prefix("avg_"),
# Aggregation into a list (very powerful)
petal_lengths = pl.col("petal_length").implode(),
# Complex: Mean of top 3 values per group
top_3_mean = pl.col("sepal_length").sort(descending=True).head(3).mean()
)| species | count | avg_sepal_length | avg_sepal_width | avg_petal_length | avg_petal_width | petal_lengths | top_3_mean |
|---|---|---|---|---|---|---|---|
| str | u32 | f64 | f64 | f64 | f64 | list[f64] | f64 |
| "versicolor" | 50 | 5.936 | 2.77 | 4.26 | 1.326 | [4.7, 4.5, … 4.1] | 6.9 |
| "setosa" | 50 | 5.006 | 3.428 | 1.462 | 0.246 | [1.4, 1.4, … 1.4] | 5.733333 |
| "virginica" | 50 | 6.588 | 2.974 | 5.552 | 2.026 | [6.0, 5.1, … 5.1] | 7.766667 |
7. Window Functions (over)
Window functions allow you to perform group-level calculations without collapsing the rows.
df.with_columns(
# Rank within species based on petal length
rank_in_species = pl.col("petal_length").rank("dense", descending=True).over("species"),
# Diff from group mean
diff_from_group = pl.col("sepal_length") - pl.col("sepal_length").mean().over("species"),
# Cumulative sum within species
cum_sum = pl.col("sepal_length").cum_sum().over("species")
).head(3)| sepal_length | sepal_width | petal_length | petal_width | species | rank_in_species | diff_from_group | cum_sum |
|---|---|---|---|---|---|---|---|
| f64 | f64 | f64 | f64 | str | u32 | f64 | f64 |
| 5.1 | 3.5 | 1.4 | 0.2 | "setosa" | 5 | 0.094 | 5.1 |
| 4.9 | 3.0 | 1.4 | 0.2 | "setosa" | 5 | -0.106 | 10.0 |
| 4.7 | 3.2 | 1.3 | 0.2 | "setosa" | 6 | -0.306 | 14.7 |
8. Complex Types: Lists & Structs
Polars has first-class support for nested data.
# Create a struct (row-wise grouping of columns)
df_struct = df.with_columns(
dims = pl.struct("sepal_length", "sepal_width")
)
# Extracting from a struct
df_struct.select(
pl.col("dims").struct.field("sepal_length")
).head(3)
# List mastery: eval() allows running full expressions inside a list column
df.group_by("species").agg(
pl.col("petal_length").implode()
).with_columns(
# For every list: sort, take top 2, and mean
top_2_avg = pl.col("petal_length").list.eval(
pl.element().sort(descending=True).head(2).mean()
)
)| species | petal_length | top_2_avg |
|---|---|---|
| str | list[f64] | list[f64] |
| "setosa" | [1.4, 1.4, … 1.4] | [1.9] |
| "virginica" | [6.0, 5.1, … 5.1] | [6.8] |
| "versicolor" | [4.7, 4.5, … 4.1] | [5.05] |
9. Time Series Mastery: asof & Rolling
Polars is blazingly fast for time-series.
# Setup time data
ts_df = df.with_columns(
time = pl.datetime_range(datetime(2023, 1, 1), datetime(2023, 1, 1) + timedelta(days=149), "1d", eager=True)
)
# Rolling window: 3-day mean of petal length
ts_df.with_columns(
rolling_mean = pl.col("petal_length").rolling_mean_by(window_size="3d", by="time")
).head(5)
# Join Asof: join on closest match (backwards/forward/nearest)
quotes = pl.DataFrame({
"time": [datetime(2023, 1, 1, 10, 0), datetime(2023, 1, 1, 10, 5)],
"price": [100.0, 101.0]
})
trades = pl.DataFrame({"time": [datetime(2023, 1, 1, 10, 2)]})
trades.join_asof(quotes, on="time", strategy="backward")| time | price |
|---|---|
| datetime[μs] | f64 |
| 2023-01-01 10:02:00 | 100.0 |
10. Reshaping: pivot and unpivot
Convert data between wide and long formats.
| species | sepal_length | sepal_width | petal_length | petal_width |
|---|---|---|---|---|
| str | f64 | f64 | f64 | f64 |
| "setosa" | 5.006 | 3.428 | 1.462 | 0.246 |
| "versicolor" | 5.936 | 2.77 | 4.26 | 1.326 |
| "virginica" | 6.588 | 2.974 | 5.552 | 2.026 |
11. Set Operations: semi and anti joins
Filter one DataFrame by another without duplicating rows or merging columns.
| sepal_length | sepal_width | petal_length | petal_width | species |
|---|---|---|---|---|
| f64 | f64 | f64 | f64 | str |
| 7.0 | 3.2 | 4.7 | 1.4 | "versicolor" |
| 6.4 | 3.2 | 4.5 | 1.5 | "versicolor" |
| 6.9 | 3.1 | 4.9 | 1.5 | "versicolor" |
12. Optimization: Categorical vs Enum
Strings should almost always be encoded. Enum is fixed and faster; Categorical is dynamic.
| sepal_length | sepal_width | petal_length | petal_width | species |
|---|---|---|---|---|
| f64 | f64 | f64 | f64 | enum |
| 5.1 | 3.5 | 1.4 | 0.2 | "setosa" |
| 4.9 | 3.0 | 1.4 | 0.2 | "setosa" |
| 4.7 | 3.2 | 1.3 | 0.2 | "setosa" |
13. Missing Data: Interpolate & Coalesce
# Coalesce: first non-null value from multiple columns
messy = pl.DataFrame({"a": [1, None, None], "b": [None, 2, None], "c": [None, None, 3]})
messy.select(filled = pl.coalesce("a", "b", "c"))
# Interpolate: linear fill for missing numeric data
df_missing = pl.DataFrame({"val": [1.0, None, 3.0]})
df_missing.select(pl.col("val").interpolate())| val |
|---|
| f64 |
| 1.0 |
| 2.0 |
| 3.0 |
14. Concatenation: pl.concat
Combine DataFrames vertically, horizontally, or diagonally.
| species | sepal_length | sepal_width | petal_length | petal_width |
|---|---|---|---|---|
| str | f64 | f64 | f64 | f64 |
| "setosa" | 6.2 | 3.4 | 5.4 | 2.3 |
| "setosa" | 5.9 | 3.0 | 5.1 | 1.8 |
15. Out-of-Core & Partitioning
# Streaming flow for massive data
(
pl.scan_csv("huge.csv")
.filter(pl.col("val") > 0)
.sink_csv("output.csv")
)
# Partitioned writing (Data Lake pattern)
# Writes files into folders like: output/species=setosa/data.parquet
df.write_parquet("output_lake", use_pyarrow=True, pyarrow_options={"partition_cols": ["species"]})Download the companion script here.