🧹 Module 3

Data Wrangling with Pandas

⏱ 14 hoursIntermediate6 topics
🎯 By the end: select and filter data precisely, fix missing values and broken types, merge multiple tables, summarise with groupby and pivot tables, and reshape data — including time series — into a clean, analysis-ready form.

Surveys say analysts spend up to 80% of their time cleaning and shaping data. This module is where you earn your keep. Pandas gives you a precise, repeatable toolkit to take raw, broken data and turn it into a tidy table you can trust.

1Selecting, indexing & filtering

Before you clean, you must select. Pandas gives you several precise ways to grab exactly the rows and columns you want.

import pandas as pd
df = pd.read_csv('orders.csv')

# one column (a Series) and several columns (a DataFrame)
amounts = df['amount']
subset  = df[['date', 'region', 'amount']]

# rows by LABEL with .loc, by POSITION with .iloc
first_row   = df.loc[0]
first_three = df.iloc[0:3]

# the workhorse: boolean filtering
big_north = df[(df['region'] == 'North') & (df['amount'] > 1000)]
print(big_north)
▶ Output
         date region  amount
0  2024-01-01  North    1200
2  2024-01-02  North    1450
You want…Use
rows/cols by namedf.loc[rows, cols]
rows/cols by numberdf.iloc[rows, cols]
rows matching a conditiondf[df['x'] > 5]
a readable filterdf.query('x > 5 and region == "North"')
Combine conditions with & and | (not and/or), and wrap each condition in parentheses — forgetting the brackets is the #1 filtering error.
Key points
  • df['col'] is a Series; df[['a','b']] is a DataFrame.
  • .loc selects by label, .iloc by integer position.
  • Boolean filtering df[(cond1) & (cond2)] — use &/| with parentheses.

2Handling missing values

Real datasets are full of gaps. Step one is always to measure the missingness; step two is to decide — drop, fill, or flag — with a reason.

import pandas as pd
df = pd.read_csv('customers.csv')

# how many missing per column?
print(df.isna().sum())
▶ Output
name        0
age        14
city        3
spend      27
dtype: int64

Three honest strategies

# 1) DROP rows missing a critical field
df = df.dropna(subset=['name'])

# 2) FILL numeric gaps with a sensible statistic (median resists outliers)
df['age'] = df['age'].fillna(df['age'].median())

# 3) FILL categorical gaps with a label, not a guess
df['city'] = df['city'].fillna('Unknown')

print(df.isna().sum().sum(), 'missing values remaining')
▶ Output
0 missing values remaining
Do not blindly drop rows. Dropping can bias your results (e.g. if older customers skip the age field, dropping them skews your averages young). Always ask why data is missing before deleting it.
Key points
  • Start with df.isna().sum() to quantify missingness per column.
  • dropna(subset=...) removes rows; fillna() imputes — choose with a reason.
  • Median fills resist outliers; fill categories with an explicit label like 'Unknown'.

3Fixing data types & cleaning text

A number stored as text will not add up; a date stored as text will not sort. Fixing types is core wrangling.

import pandas as pd
df = pd.read_csv('sales_raw.csv')

# text -> number; bad values become NaN instead of crashing
df['amount'] = pd.to_numeric(df['amount'], errors='coerce')

# text -> real dates
df['date'] = pd.to_datetime(df['date'])

# clean strings with the .str accessor
df['region'] = df['region'].str.strip().str.title()
df['email']  = df['email'].str.lower()

print(df.dtypes)
▶ Output
date      datetime64[ns]
region            object
amount           float64
email             object
dtype: object
The .str accessor applies a string method to a whole column at once: .str.strip(), .str.lower(), .str.replace('₹',''), .str.contains('north'). No loop needed.
# category type saves memory and speeds up grouping on repeated values
df['region'] = df['region'].astype('category')
print(df['region'].cat.categories)
▶ Output
Index(['East', 'North', 'South', 'West'], dtype='object')
Key points
  • pd.to_numeric(col, errors='coerce') turns bad numbers into NaN instead of crashing.
  • pd.to_datetime() makes dates sortable and unlocks time-series tools.
  • The .str accessor cleans whole text columns; astype('category') speeds up repeated labels.

4Combining tables: merge, join & concat

Insight usually needs more than one table — orders + customers, sales + targets. merge joins on a shared key, just like a SQL join.

import pandas as pd
orders    = pd.read_csv('orders.csv')      # has customer_id
customers = pd.read_csv('customers.csv')   # has customer_id, name, city

# keep every order; attach customer details where they match
enriched = orders.merge(customers, on='customer_id', how='left')
print(enriched[['order_id', 'name', 'city', 'amount']].head(3))
▶ Output
   order_id   name     city  amount
0      1001   Asha   Mumbai    1200
1      1002  Rohan    Delhi     980
2      1003   Maya     Pune    1450
INNER joinonly matching keysLEFT joinall left rows + matches
An INNER join keeps only matches; a LEFT join keeps every row from the left table.

Use pd.concat([df1, df2]) to stack tables with the same columns (e.g. January + February files) on top of each other.

Key points
  • df.merge(other, on='key', how=...) joins tables on a shared column.
  • how='inner' keeps only matches; how='left' keeps every left-table row.
  • pd.concat([...]) stacks tables that share the same columns.

5GroupBy, aggregation & pivot tables

This is the heart of analytics: split the data into groups, apply a calculation, and combine the results. One line answers questions like “total sales per region”.

import pandas as pd
df = pd.read_csv('sales.csv')

# total and average sales per region
summary = df.groupby('region')['amount'].agg(['sum', 'mean', 'count'])
print(summary)
regionsummeancount
East41001366.73
North53001325.04
South2940980.03
West38001266.73

Pivot tables — like Excel, but reproducible

# sales by region (rows) across quarters (columns)
pivot = df.pivot_table(index='region', columns='quarter',
                       values='amount', aggfunc='sum', fill_value=0)
print(pivot)
regionQ1Q2Q3Q4
North1200145013001350
South98010209400
Multiple metrics at once: pass a dict to agg, e.g. df.groupby('region').agg(revenue=('amount','sum'), orders=('order_id','count')) for named output columns.
Key points
  • groupby(col)[metric].agg([...]) = split → apply → combine, the core of analysis.
  • pivot_table() reshapes group results into a spreadsheet-style cross-tab.
  • Use named aggregations (name=('col','func')) for clear, report-ready column names.

6Reshaping & time-series wrangling

The last skill is changing a table's shape — wide↔long — and working with dates over time.

Wide ↔ long with melt

# wide: one column per month  ->  long: a 'month' column + a 'sales' column
long = wide.melt(id_vars='region', var_name='month', value_name='sales')
print(long.head(3))
▶ Output
  region month  sales
0  North   Jan   1200
1  South   Jan    980
2  North   Feb   1450

Time series: resample & rolling

df['date'] = pd.to_datetime(df['date'])
ts = df.set_index('date')

monthly = ts['amount'].resample('M').sum()        # roll daily up to monthly
trend   = monthly.rolling(window=3).mean()        # 3-month moving average
print(monthly.tail(3))
▶ Output
date
2024-04-30    38200
2024-05-31    41050
2024-06-30    44300
Freq: M, Name: amount, dtype: int64
Think in columns, not loops. Nearly every wrangling task has a vectorised Pandas method that runs on the whole column at once — far faster and clearer than a Python for loop. Reach for the method before writing a loop.
Key points
  • melt() turns wide data into long (tidy) data; pivot() does the reverse.
  • Set a datetime index, then resample('M') to change frequency and rolling() for moving averages.
  • Prefer vectorised column operations over for loops — faster and more readable.

★ Hands-on Project — E-Commerce Data Cleanup

You receive a deliberately messy e-commerce export. Turn it into one clean, analysis-ready master table — the exact task analysts do every week.

  1. Load orders_raw.csv and inspect it with df.head(), df.info() and df.isna().sum().
  2. Fix types: convert amount with pd.to_numeric(errors='coerce') and order_date with pd.to_datetime.
  3. Clean text columns with the .str accessor (strip spaces, fix inconsistent casing like 'north'/'North').
  4. Handle missing values deliberately: drop rows with no order_id, fill numeric gaps with the median, label missing categories 'Unknown'.
  5. Remove exact duplicate rows with df.drop_duplicates().
  6. Merge in customers.csv on customer_id with a left join to add customer name and city.
  7. Build a summary with groupby or pivot_table: total revenue and order count per region.
  8. Save the cleaned master table to orders_clean.csv and commit the notebook/script to GitHub.

Ready to test yourself?

Take the module quiz. Score 70% or more to mark this module complete.

Start the quiz →

💡 Log in to save your progress and earn the certificate.