Surveys say analysts spend up to 80% of their time cleaning and shaping data. This module is where you earn your keep. Pandas gives you a precise, repeatable toolkit to take raw, broken data and turn it into a tidy table you can trust.
1Selecting, indexing & filtering
Before you clean, you must select. Pandas gives you several precise ways to grab exactly the rows and columns you want.
import pandas as pd
df = pd.read_csv('orders.csv')
# one column (a Series) and several columns (a DataFrame)
amounts = df['amount']
subset = df[['date', 'region', 'amount']]
# rows by LABEL with .loc, by POSITION with .iloc
first_row = df.loc[0]
first_three = df.iloc[0:3]
# the workhorse: boolean filtering
big_north = df[(df['region'] == 'North') & (df['amount'] > 1000)]
print(big_north)date region amount 0 2024-01-01 North 1200 2 2024-01-02 North 1450
| You want… | Use |
|---|---|
| rows/cols by name | df.loc[rows, cols] |
| rows/cols by number | df.iloc[rows, cols] |
| rows matching a condition | df[df['x'] > 5] |
| a readable filter | df.query('x > 5 and region == "North"') |
& and | (not and/or), and wrap each condition in parentheses — forgetting the brackets is the #1 filtering error.df['col']is a Series;df[['a','b']]is a DataFrame..locselects by label,.ilocby integer position.- Boolean filtering
df[(cond1) & (cond2)]— use&/|with parentheses.
2Handling missing values
Real datasets are full of gaps. Step one is always to measure the missingness; step two is to decide — drop, fill, or flag — with a reason.
import pandas as pd
df = pd.read_csv('customers.csv')
# how many missing per column?
print(df.isna().sum())name 0 age 14 city 3 spend 27 dtype: int64
Three honest strategies
# 1) DROP rows missing a critical field
df = df.dropna(subset=['name'])
# 2) FILL numeric gaps with a sensible statistic (median resists outliers)
df['age'] = df['age'].fillna(df['age'].median())
# 3) FILL categorical gaps with a label, not a guess
df['city'] = df['city'].fillna('Unknown')
print(df.isna().sum().sum(), 'missing values remaining')0 missing values remaining
- Start with
df.isna().sum()to quantify missingness per column. dropna(subset=...)removes rows;fillna()imputes — choose with a reason.- Median fills resist outliers; fill categories with an explicit label like
'Unknown'.
3Fixing data types & cleaning text
A number stored as text will not add up; a date stored as text will not sort. Fixing types is core wrangling.
import pandas as pd
df = pd.read_csv('sales_raw.csv')
# text -> number; bad values become NaN instead of crashing
df['amount'] = pd.to_numeric(df['amount'], errors='coerce')
# text -> real dates
df['date'] = pd.to_datetime(df['date'])
# clean strings with the .str accessor
df['region'] = df['region'].str.strip().str.title()
df['email'] = df['email'].str.lower()
print(df.dtypes)date datetime64[ns] region object amount float64 email object dtype: object
.str accessor applies a string method to a whole column at once: .str.strip(), .str.lower(), .str.replace('₹',''), .str.contains('north'). No loop needed.# category type saves memory and speeds up grouping on repeated values
df['region'] = df['region'].astype('category')
print(df['region'].cat.categories)Index(['East', 'North', 'South', 'West'], dtype='object')
pd.to_numeric(col, errors='coerce')turns bad numbers into NaN instead of crashing.pd.to_datetime()makes dates sortable and unlocks time-series tools.- The
.straccessor cleans whole text columns;astype('category')speeds up repeated labels.
4Combining tables: merge, join & concat
Insight usually needs more than one table — orders + customers, sales + targets. merge joins on a shared key, just like a SQL join.
import pandas as pd
orders = pd.read_csv('orders.csv') # has customer_id
customers = pd.read_csv('customers.csv') # has customer_id, name, city
# keep every order; attach customer details where they match
enriched = orders.merge(customers, on='customer_id', how='left')
print(enriched[['order_id', 'name', 'city', 'amount']].head(3))order_id name city amount 0 1001 Asha Mumbai 1200 1 1002 Rohan Delhi 980 2 1003 Maya Pune 1450
Use pd.concat([df1, df2]) to stack tables with the same columns (e.g. January + February files) on top of each other.
df.merge(other, on='key', how=...)joins tables on a shared column.how='inner'keeps only matches;how='left'keeps every left-table row.pd.concat([...])stacks tables that share the same columns.
5GroupBy, aggregation & pivot tables
This is the heart of analytics: split the data into groups, apply a calculation, and combine the results. One line answers questions like “total sales per region”.
import pandas as pd
df = pd.read_csv('sales.csv')
# total and average sales per region
summary = df.groupby('region')['amount'].agg(['sum', 'mean', 'count'])
print(summary)| region | sum | mean | count |
|---|---|---|---|
| East | 4100 | 1366.7 | 3 |
| North | 5300 | 1325.0 | 4 |
| South | 2940 | 980.0 | 3 |
| West | 3800 | 1266.7 | 3 |
Pivot tables — like Excel, but reproducible
# sales by region (rows) across quarters (columns)
pivot = df.pivot_table(index='region', columns='quarter',
values='amount', aggfunc='sum', fill_value=0)
print(pivot)| region | Q1 | Q2 | Q3 | Q4 |
|---|---|---|---|---|
| North | 1200 | 1450 | 1300 | 1350 |
| South | 980 | 1020 | 940 | 0 |
agg, e.g. df.groupby('region').agg(revenue=('amount','sum'), orders=('order_id','count')) for named output columns.groupby(col)[metric].agg([...])= split → apply → combine, the core of analysis.pivot_table()reshapes group results into a spreadsheet-style cross-tab.- Use named aggregations (
name=('col','func')) for clear, report-ready column names.
6Reshaping & time-series wrangling
The last skill is changing a table's shape — wide↔long — and working with dates over time.
Wide ↔ long with melt
# wide: one column per month -> long: a 'month' column + a 'sales' column
long = wide.melt(id_vars='region', var_name='month', value_name='sales')
print(long.head(3))region month sales 0 North Jan 1200 1 South Jan 980 2 North Feb 1450
Time series: resample & rolling
df['date'] = pd.to_datetime(df['date'])
ts = df.set_index('date')
monthly = ts['amount'].resample('M').sum() # roll daily up to monthly
trend = monthly.rolling(window=3).mean() # 3-month moving average
print(monthly.tail(3))date 2024-04-30 38200 2024-05-31 41050 2024-06-30 44300 Freq: M, Name: amount, dtype: int64
for loop. Reach for the method before writing a loop.melt()turns wide data into long (tidy) data;pivot()does the reverse.- Set a datetime index, then
resample('M')to change frequency androlling()for moving averages. - Prefer vectorised column operations over
forloops — faster and more readable.
★ Hands-on Project — E-Commerce Data Cleanup
You receive a deliberately messy e-commerce export. Turn it into one clean, analysis-ready master table — the exact task analysts do every week.
- Load
orders_raw.csvand inspect it withdf.head(),df.info()anddf.isna().sum(). - Fix types: convert
amountwithpd.to_numeric(errors='coerce')andorder_datewithpd.to_datetime. - Clean text columns with the
.straccessor (strip spaces, fix inconsistent casing like 'north'/'North'). - Handle missing values deliberately: drop rows with no
order_id, fill numeric gaps with the median, label missing categories 'Unknown'. - Remove exact duplicate rows with
df.drop_duplicates(). - Merge in
customers.csvoncustomer_idwith a left join to add customer name and city. - Build a summary with
groupbyorpivot_table: total revenue and order count per region. - Save the cleaned master table to
orders_clean.csvand commit the notebook/script to GitHub.
Ready to test yourself?
Take the module quiz. Score 70% or more to mark this module complete.
Start the quiz →💡 Log in to save your progress and earn the certificate.