🧹 Module 3

Data Wrangling with Pandas

⏱ 14 hoursIntermediate6 topics

🎯 By the end: select and filter data precisely, fix missing values and broken types, merge multiple tables, summarise with groupby and pivot tables, and reshape data — including time series — into a clean, analysis-ready form.

Surveys say analysts spend up to 80% of their time cleaning and shaping data. This module is where you earn your keep. Pandas gives you a precise, repeatable toolkit to take raw, broken data and turn it into a tidy table you can trust.

1Selecting, indexing & filtering

Before you clean, you must select. Pandas gives you several precise ways to grab exactly the rows and columns you want.

import pandas as pd
df = pd.read_csv('orders.csv')

# one column (a Series) and several columns (a DataFrame)
amounts = df['amount']
subset  = df[['date', 'region', 'amount']]

# rows by LABEL with .loc, by POSITION with .iloc
first_row   = df.loc[0]
first_three = df.iloc[0:3]

# the workhorse: boolean filtering
big_north = df[(df['region'] == 'North') & (df['amount'] > 1000)]
print(big_north)

▶ Output

         date region  amount
0  2024-01-01  North    1200
2  2024-01-02  North    1450

You want…	Use
rows/cols by name	`df.loc[rows, cols]`
rows/cols by number	`df.iloc[rows, cols]`
rows matching a condition	`df[df['x'] > 5]`
a readable filter	`df.query('x > 5 and region == "North"')`

Combine conditions with & and | (not and/or), and wrap each condition in parentheses — forgetting the brackets is the #1 filtering error.

Key points

df['col'] is a Series; df[['a','b']] is a DataFrame.
.loc selects by label, .iloc by integer position.
Boolean filtering df[(cond1) & (cond2)] — use &/| with parentheses.

2Handling missing values

Real datasets are full of gaps. Step one is always to measure the missingness; step two is to decide — drop, fill, or flag — with a reason.

import pandas as pd
df = pd.read_csv('customers.csv')

# how many missing per column?
print(df.isna().sum())

▶ Output

name        0
age        14
city        3
spend      27
dtype: int64

Three honest strategies

# 1) DROP rows missing a critical field
df = df.dropna(subset=['name'])

# 2) FILL numeric gaps with a sensible statistic (median resists outliers)
df['age'] = df['age'].fillna(df['age'].median())

# 3) FILL categorical gaps with a label, not a guess
df['city'] = df['city'].fillna('Unknown')

print(df.isna().sum().sum(), 'missing values remaining')

▶ Output

0 missing values remaining

Do not blindly drop rows. Dropping can bias your results (e.g. if older customers skip the age field, dropping them skews your averages young). Always ask why data is missing before deleting it.

Key points

Start with df.isna().sum() to quantify missingness per column.
dropna(subset=...) removes rows; fillna() imputes — choose with a reason.
Median fills resist outliers; fill categories with an explicit label like 'Unknown'.

3Fixing data types & cleaning text

A number stored as text will not add up; a date stored as text will not sort. Fixing types is core wrangling.

import pandas as pd
df = pd.read_csv('sales_raw.csv')

# text -> number; bad values become NaN instead of crashing
df['amount'] = pd.to_numeric(df['amount'], errors='coerce')

# text -> real dates
df['date'] = pd.to_datetime(df['date'])

# clean strings with the .str accessor
df['region'] = df['region'].str.strip().str.title()
df['email']  = df['email'].str.lower()

print(df.dtypes)

▶ Output

date      datetime64[ns]
region            object
amount           float64
email             object
dtype: object

The .str accessor applies a string method to a whole column at once: .str.strip(), .str.lower(), .str.replace('₹',''), .str.contains('north'). No loop needed.

# category type saves memory and speeds up grouping on repeated values
df['region'] = df['region'].astype('category')
print(df['region'].cat.categories)

▶ Output

Index(['East', 'North', 'South', 'West'], dtype='object')

Key points

pd.to_numeric(col, errors='coerce') turns bad numbers into NaN instead of crashing.
pd.to_datetime() makes dates sortable and unlocks time-series tools.
The .str accessor cleans whole text columns; astype('category') speeds up repeated labels.

4Combining tables: merge, join & concat

Insight usually needs more than one table — orders + customers, sales + targets. merge joins on a shared key, just like a SQL join.

import pandas as pd
orders    = pd.read_csv('orders.csv')      # has customer_id
customers = pd.read_csv('customers.csv')   # has customer_id, name, city

# keep every order; attach customer details where they match
enriched = orders.merge(customers, on='customer_id', how='left')
print(enriched[['order_id', 'name', 'city', 'amount']].head(3))

▶ Output

   order_id   name     city  amount
0      1001   Asha   Mumbai    1200
1      1002  Rohan    Delhi     980
2      1003   Maya     Pune    1450

An INNER join keeps only matches; a LEFT join keeps every row from the left table.

Use pd.concat([df1, df2]) to stack tables with the same columns (e.g. January + February files) on top of each other.

Key points

df.merge(other, on='key', how=...) joins tables on a shared column.
how='inner' keeps only matches; how='left' keeps every left-table row.
pd.concat([...]) stacks tables that share the same columns.

5GroupBy, aggregation & pivot tables

This is the heart of analytics: split the data into groups, apply a calculation, and combine the results. One line answers questions like “total sales per region”.

import pandas as pd
df = pd.read_csv('sales.csv')

# total and average sales per region
summary = df.groupby('region')['amount'].agg(['sum', 'mean', 'count'])
print(summary)

region	sum	mean	count
East	4100	1366.7	3
North	5300	1325.0	4
South	2940	980.0	3
West	3800	1266.7	3

Pivot tables — like Excel, but reproducible

# sales by region (rows) across quarters (columns)
pivot = df.pivot_table(index='region', columns='quarter',
                       values='amount', aggfunc='sum', fill_value=0)
print(pivot)

region	Q1	Q2	Q3	Q4
North	1200	1450	1300	1350
South	980	1020	940	0

Multiple metrics at once: pass a dict to agg, e.g. df.groupby('region').agg(revenue=('amount','sum'), orders=('order_id','count')) for named output columns.

Key points

groupby(col)[metric].agg([...]) = split → apply → combine, the core of analysis.
pivot_table() reshapes group results into a spreadsheet-style cross-tab.
Use named aggregations (name=('col','func')) for clear, report-ready column names.

6Reshaping & time-series wrangling

The last skill is changing a table's shape — wide↔long — and working with dates over time.

Wide ↔ long with melt

# wide: one column per month  ->  long: a 'month' column + a 'sales' column
long = wide.melt(id_vars='region', var_name='month', value_name='sales')
print(long.head(3))

▶ Output

  region month  sales
0  North   Jan   1200
1  South   Jan    980
2  North   Feb   1450

Time series: resample & rolling

df['date'] = pd.to_datetime(df['date'])
ts = df.set_index('date')

monthly = ts['amount'].resample('M').sum()        # roll daily up to monthly
trend   = monthly.rolling(window=3).mean()        # 3-month moving average
print(monthly.tail(3))

▶ Output

date
2024-04-30    38200
2024-05-31    41050
2024-06-30    44300
Freq: M, Name: amount, dtype: int64

Think in columns, not loops. Nearly every wrangling task has a vectorised Pandas method that runs on the whole column at once — far faster and clearer than a Python for loop. Reach for the method before writing a loop.

Key points

melt() turns wide data into long (tidy) data; pivot() does the reverse.
Set a datetime index, then resample('M') to change frequency and rolling() for moving averages.
Prefer vectorised column operations over for loops — faster and more readable.

★ Hands-on Project — E-Commerce Data Cleanup

You receive a deliberately messy e-commerce export. Turn it into one clean, analysis-ready master table — the exact task analysts do every week.

Load orders_raw.csv and inspect it with df.head(), df.info() and df.isna().sum().
Fix types: convert amount with pd.to_numeric(errors='coerce') and order_date with pd.to_datetime.
Clean text columns with the .str accessor (strip spaces, fix inconsistent casing like 'north'/'North').
Handle missing values deliberately: drop rows with no order_id, fill numeric gaps with the median, label missing categories 'Unknown'.
Remove exact duplicate rows with df.drop_duplicates().
Merge in customers.csv on customer_id with a left join to add customer name and city.
Build a summary with groupby or pivot_table: total revenue and order count per region.
Save the cleaned master table to orders_clean.csv and commit the notebook/script to GitHub.

Ready to test yourself?

Take the module quiz. Score 70% or more to mark this module complete.

Start the quiz →

💡 Log in to save your progress and earn the certificate.

← Previous

Data Collection & Ingestion

Exploratory Data Analysis (EDA)