Pandas GroupBy Deep
Deep dive · part of Python Pandas Basics
groupby implements the split-apply-combine pattern. You can aggregate (sum, mean), transform (return same shape), filter (drop groups) and apply arbitrary functions. Multiple aggregations at once use agg with a dict or list.
groupby implements split-apply-combine: partition rows by keys, apply aggregation/transform/filter, combine results. agg with named aggregations produces clear column names; transform returns same-length Series aligned to original index.
Multi-index groupby handles hierarchical keys; observed=True skips empty categorical combinations in 1.5+.
Analysts live in groupby for cohort metrics, funnel breakdowns, and experiment analysis—fluent agg and transform separate juniors from productive contributors.
Production code combines this topic with logging, tests, and clear module boundaries so refactors stay safe when requirements grow.
agg(mean), agg({'col': 'sum'}) and agg(avg=('col','mean')) styles.
transform computes group-wise stats per row (z-scores, fills).
filter(lambda g: len(g) > 5) drops whole groups failing predicate.
apply on groups flexible but slower—prefer vectorized agg.
as_index=False keeps group keys as columns after reset.
Grouper objects time-group on datetime columns.
Named aggregation (pandas 0.25+) clarifies output schema for pipelines. Categorical dtypes speed groupby when cardinality modest.
Watch NA groups: NaN keys form their own group unless dropna=False.
Flatten MultiIndex columns after agg for SQL export—downstream tools need stable flat names without ambiguous levels.
Flatten MultiIndex columns after named agg so BI tools and CSV exports get stable headers.
Read the parent tutorial on pythondeck.com for runnable snippets, then reproduce them locally in a virtual environment with pinned dependency versions matching your deployment target.
When pairing with teammates, agree on one idiomatic pattern per concern—mixed styles in one repo slow reviews and invite subtle integration bugs during merges.
apply(lambda g: ...) row-wise Python loops on huge groups—slow.
Forgetting transform returns original index order requirements.
Aggregating object columns with mean—dtype errors.
Chaining groupby without reset_index leading to ambiguous index.
Use named agg in production ETL for stable column names.
Filter invalid groups early to shrink data.
Validate group sizes with value_counts on keys.
Export agg results to parquet with explicit schema.
Print df.groupby(keys).size() before heavy agg to catch accidental cartesian keys.
Re-read the examples below with these ideas in mind; change variable names and inputs to match your own project.
The program below demonstrates multi-aggregation. Read the comments on each line, run the code, then change names or values to see how the output shifts.
# Example: Multi-aggregation
# Run in the REPL or save as a .py file and execute with python.
import pandas as pd
df = pd.DataFrame({
"dept": ["R&D","R&D","OPS","OPS","R&D"],
"salary":[100, 120, 80, 90, 130],
"age": [30, 45, 25, 40, 50],
})
print(df.groupby("dept").agg(
avg_salary=("salary", "mean"),
total=("salary", "sum"),
n=("salary", "count"),
))
This sample walks through transform (z-score per group) in a small, runnable script. Paste it into the REPL or save it as a .py file before you continue to the next block.
# Example: Transform (z-score per group)
# Run in the REPL or save as a .py file and execute with python.
import pandas as pd
df = pd.DataFrame({
"team": ["A","A","A","B","B","B"],
"score":[1, 2, 3, 10, 20, 30],
})
df["z"] = df.groupby("team")["score"].transform(
lambda s: (s - s.mean()) / s.std())
print(df)
Here is a hands-on illustration of filter small groups. Follow the inline comments first; only then execute the snippet and compare the result with what you expected.
# Example: Filter small groups
# Run in the REPL or save as a .py file and execute with python.
import pandas as pd
df = pd.DataFrame({"k": list("aaabbcccdd"), "v": range(10)})
big = df.groupby("k").filter(lambda g: len(g) >= 3)
print(big)
The program below demonstrates agg multiple. Read the comments on each line, run the code, then change names or values to see how the output shifts.
# agg applies several reductions per column
import pandas as pd # pandas
df = pd.DataFrame({"dept": ["R&D", "R&D", "OPS"], "salary": [100, 120, 80]}) # data
out = df.groupby("dept").agg(avg=("salary", "mean"), total=("salary", "sum")) # named aggs
print(out) # per-dept stats
print(out.loc["R&D", "avg"]) # extract one cell
print(out.columns.tolist()) # column MultiIndex flattened
print(len(out)) # number of groups
This sample walks through transform zscore in a small, runnable script. Paste it into the REPL or save it as a .py file before you continue to the next block.
# transform returns same shape as input — per-group z-score
import pandas as pd # pandas
df = pd.DataFrame({"g": ["A", "A", "B", "B"], "v": [1, 3, 10, 20]}) # groups
df["z"] = df.groupby("g")["v"].transform(lambda s: (s - s.mean()) / s.std()) # z
print(df) # includes z column
print(df.groupby("g")["z"].mean().round(3)) # group means of z ~ 0
print(df.shape) # same row count as input
print(df["z"].tolist()) # z column values
Related deep dives on Python Pandas Basics: