Python Pandas Basics

Tutorial 47 of 65 · pythondeck.com Python course

pandas provides labelled tabular data via DataFrame and Series. Excellent for data cleaning, joining, aggregation and time series. Built on NumPy and integrates with Matplotlib.

Pandas wraps NumPy with labeled axes: Series and DataFrame are the workhorses of tabular analysis, ETL, and reporting. Missing values, time zones, and group-by aggregations are first-class.

Solid pandas basics save hours when cleaning CSV exports, joining logs, or preparing features for models.

DataFrame / Series — index + columns; alignment on labels during arithmetic.

read_csv / to_parquet — ingest and persist; mind dtypes and parse_dates.

Selection — loc label-based, iloc position-based; avoid chained assignment.

missing data — isna, fillna, dropna; know when imputation biases analysis.

groupby — split-apply-combine; agg with named aggregations for readable summaries.

merge / concat — SQL-like joins; validate row counts and duplicate keys.

Copy-on-write semantics (newer pandas) reduce accidental copies but chained indexing still bites. Prefer df.loc[mask, 'col'] = value for updates. Categorical dtypes shrink memory for repeated strings. Datetimes should be timezone-aware (UTC) before aggregating global events.

For larger-than-RAM data, chunk read_csv, use PyArrow-backed dtypes, or move to Polars/DuckDB—but pandas remains the lingua franca for exploratory work.

Chained indexing like df[df['a']>0]['b'] = 1 causing SettingWithCopyWarning.

Applying lambda row-wise (axis=1) on millions of rows.

Joining without checking duplicate keys, silently multiplying rows.

Parsing dates as objects instead of datetime64[ns, UTC].

Inspect df.info(), describe(), and value_counts after every load.

Use method chaining with pipe for readable ETL steps.

Assert row counts after merges; keep a _merge indicator when debugging.

Export stable parquet with explicit dtypes for downstream pipelines.

Write assert len(df) checks after filter steps that must not drop rows silently.

Re-read the examples below with these ideas in mind; change variable names and inputs to match your own project.

The program below demonstrates dataframe. Read the comments on each line, run the code, then change names or values to see how the output shifts.

# Example: DataFrame
# Run in the REPL or save as a .py file and execute with python.
import pandas as pd
df = pd.DataFrame({
    "name": ["Ada", "Grace", "Linus"],
    "score": [99, 97, 85],
})
print(df)
print(df.describe())

This sample walks through filter / sort in a small, runnable script. Paste it into the REPL or save it as a .py file before you continue to the next block.

# Example: Filter / sort
# Run in the REPL or save as a .py file and execute with python.
import pandas as pd
df = pd.DataFrame({"name": ["a","b","c"], "score": [99, 70, 85]})
print(df[df.score > 80].sort_values("score", ascending=False))

Here is a hands-on illustration of group by. Follow the inline comments first; only then execute the snippet and compare the result with what you expected.

# Example: Group by
# Run in the REPL or save as a .py file and execute with python.
import pandas as pd
df = pd.DataFrame({
    "city": ["NY","NY","LA","LA"],
    "sales": [10, 20, 5, 7],
})
print(df.groupby("city").sum())

The program below demonstrates dataframe io. Read the comments on each line, run the code, then change names or values to see how the output shifts.

# pandas DataFrame is columnar table with labeled axes
import pandas as pd  # tabular toolkit
df = pd.DataFrame({"name": ["Ada", "Grace"], "score": [99, 97]})  # build
print(df.head())  # first rows
print(df["score"].mean())  # column stats
df["passed"] = df["score"] >= 98  # new boolean column
print(df[df["passed"]])  # filter rows
df.to_csv("scores.csv", index=False)  # export CSV
back = pd.read_csv("scores.csv")  # reload
print(back.columns.tolist())  # column names

This sample walks through series ops in a small, runnable script. Paste it into the REPL or save it as a .py file before you continue to the next block.

# Series is 1-D labeled array — building block of DataFrame
import pandas as pd  # pandas
s = pd.Series([10, 20, 30], index=["a", "b", "c"])  # labeled
print(s.loc["b"])  # label-based access -> 20
print(s.iloc[0])  # position access -> 10
print(s + 5)  # vectorized add
print(s.describe())  # count/mean/std/min/max
rolled = s.rolling(2).mean()  # moving average
print(rolled)  # NaN for first window

Continue with these focused follow-up lessons on Python Pandas Basics:

« Python NumPy Basics All tutorials Python Matplotlib »