Data scientists famously spend most of their time not modelling, but wrangling — getting data from somewhere, then beating it into a clean, tidy shape. The tool for that job is pandas, the most important library in applied data science. If Module 1 gave you the engine (NumPy), pandas is the whole vehicle: labelled rows and columns, missing-value handling, joins, grouping and reshaping, all vectorised. Master this module and the rest of the course becomes far easier, because every model starts from a clean DataFrame you built.
1From NumPy to pandas: Series & DataFrame
pandas adds labels to NumPy's fast arrays. A Series is one labelled column; a DataFrame is a whole table — rows and named columns — exactly like a spreadsheet or a database table, but programmable.
Build a DataFrame from scratch
import pandas as pd
df = pd.DataFrame({
'name': ['Asha', 'Ravi', 'Meera', 'John'],
'age': [30, 25, 35, 28],
'city': ['Mumbai', 'Delhi', 'Pune', 'Delhi'],
})
print(df)
print('Shape:', df.shape)
print('Dtypes:\n', df.dtypes)name age city 0 Asha 30 Mumbai 1 Ravi 25 Delhi 2 Meera 35 Pune 3 John 28 Delhi Shape: (4, 3) Dtypes: name object age int64 city object dtype: object
object usually means text. Numeric columns become int64 or float64. Checking df.dtypes early saves hours — a number stored as text will not add up.- A Series is one labelled column; a DataFrame is a table of aligned Series sharing an index.
df.shapegives (rows, columns);df.dtypesshows each column's type.objectdtype usually means text — verify types before doing maths.
2Reading data from anywhere: CSV, Excel, JSON, SQL, APIs
Real projects pull data from many places. pandas has a read_* function for each, and they all return a DataFrame so the rest of your code is identical.
| Source | One-liner |
|---|---|
| CSV file | pd.read_csv('data.csv') |
| Excel sheet | pd.read_excel('book.xlsx', sheet_name='Sales') |
| JSON / API | pd.read_json('records.json') |
| SQL database | pd.read_sql('SELECT * FROM sales', conn) |
| HTML table on a page | pd.read_html('https://...')[0] |
Read a CSV and look at it
import pandas as pd
sales = pd.read_csv('sales.csv') # parse_dates=['date'] for real dates
print(sales.head(3)) # first 3 rows
print('Rows, cols:', sales.shape)| date | region | product | amount | |
|---|---|---|---|---|
| 0 | 2024-01-01 | North | Widget | 1200 |
| 1 | 2024-01-01 | South | Gadget | 980 |
| 2 | 2024-01-02 | North | Widget | 1450 |
Pull live data from an API
Most public data APIs return JSON. Fetch it, then hand it to pandas.
import requests
import pandas as pd
url = 'https://api.example.com/v1/exchange-rates'
resp = requests.get(url, timeout=10)
resp.raise_for_status() # fail loudly on a bad response
data = resp.json() # JSON -> Python list/dict
rates = pd.DataFrame(data['rates'])
print(rates.head())timeout and call raise_for_status(). A silent network failure that returns half a page is how corrupt data sneaks into a pipeline. Fail early and loudly.- Every
pd.read_*function returns a DataFrame, so downstream code stays the same. - Use
parse_dates=on read so date columns become real datetimes, not text. - For APIs: set a timeout, call
raise_for_status(), then build a DataFrame from the JSON.
3Inspecting & selecting: head, info, loc, iloc & filtering
Before you change anything, look. Then learn the two ways to select: by label (loc) and by position (iloc). Filtering with Boolean masks — the idea from Module 1 — is how you ask questions of your data.
Get the lay of the land
sales.info() # columns, non-null counts, dtypes, memory
sales.describe() # count, mean, std, min, quartiles, max
sales['region'].value_counts() # how many rows per categoryregion North 312 South 288 East 205 West 195 Name: count, dtype: int64
Select columns, rows & cells
# one column (a Series) and several columns (a DataFrame)
amounts = sales['amount']
subset = sales[['region', 'amount']]
# label-based vs position-based
first_label = sales.loc[0, 'amount'] # by index label + column name
first_pos = sales.iloc[0, 3] # by row 0, column 3
print(first_label, first_pos)1200 1200
Filter with Boolean masks
# Big sales in the North region
mask = (sales['amount'] > 1000) & (sales['region'] == 'North')
big_north = sales[mask]
print('Matching rows:', len(big_north))
print('Their mean amount:', round(big_north['amount'].mean(), 2))Matching rows: 142 Their mean amount: 1487.36
&, |, ~ — and wrap each in parentheses. Python's and/or do not work element-wise on Series, and missing parentheses cause confusing errors.- Start with
info(),describe()andvalue_counts()to understand a new dataset. locselects by label,ilocby integer position — keep them straight.- Filter with Boolean masks; combine conditions with
& | ~and parenthesise each one.
4Cleaning: missing values, duplicates, types & text
Real data is dirty. The four recurring problems are missing values, duplicates, wrong types and messy text. Cleaning them is the difference between a trustworthy result and a confident lie.
Find and handle missing values
print(sales.isna().sum()) # missing count per column
# Strategy 1: drop rows missing a critical field
sales = sales.dropna(subset=['amount'])
# Strategy 2: fill with a sensible default
sales['region'] = sales['region'].fillna('Unknown')
# Strategy 3: fill numbers with the column median (robust to outliers)
median_amt = sales['amount'].median()
sales['amount'] = sales['amount'].fillna(median_amt)date 0 region 4 product 0 amount 11 dtype: int64
Remove duplicates & fix types
before = len(sales)
sales = sales.drop_duplicates()
print('Removed', before - len(sales), 'duplicate rows')
# Convert a text column that should be numeric
sales['amount'] = pd.to_numeric(sales['amount'], errors='coerce')
# Parse a date column properly
sales['date'] = pd.to_datetime(sales['date'])Removed 7 duplicate rows
Clean messy text with .str
# " North ", "north", "NORTH" should all be "North"
sales['region'] = (sales['region']
.str.strip() # trim spaces
.str.title()) # Title Case
print(sales['region'].unique())['North' 'South' 'East' 'West' 'Unknown']
isna().sum()reveals missingness; choose to drop, fill, or impute deliberately.drop_duplicates()removes repeats;to_numeric/to_datetimefix bad types (useerrors='coerce').- The
.straccessor (strip,lower,title,replace) standardises messy text columns.
5Combining & reshaping: merge, concat, groupby, pivot
Insight usually lives across tables and in the right shape. Four moves cover almost everything: merge (join on a key), concat (stack), groupby (split-apply-combine) and pivot/melt (reshape wide↔long).
Merge two tables on a key
# sales has a 'product' column; products has price & category
merged = sales.merge(products, on='product', how='left')
print(merged[['product', 'amount', 'category']].head(3))Group, aggregate & answer business questions
# Total and average sales per region
summary = (sales
.groupby('region')['amount']
.agg(['sum', 'mean', 'count'])
.round(2)
.sort_values('sum', ascending=False))
print(summary)| region | sum | mean | count |
|---|---|---|---|
| North | 464160 | 1487.69 | 312 |
| South | 361440 | 1255.00 | 288 |
| East | 240875 | 1175.00 | 205 |
| West | 222300 | 1140.00 | 195 |
Reshape with pivot_table
# Region down the side, product across the top, summing amount
wide = sales.pivot_table(index='region', columns='product',
values='amount', aggfunc='sum', fill_value=0)
print(wide)merge(..., on=key, how='left')joins tables;concatstacks them.groupby(col)[target].agg([...])is split-apply-combine — the core of analysis.pivot_tablereshapes long data into a wide summary;meltdoes the reverse.
6Feature engineering & saving tidy data
The final wrangling step turns clean data into features a model can learn from, then saves the result so the expensive cleaning never has to run twice.
Create new columns
# Derive useful features from existing columns
sales['month'] = sales['date'].dt.month_name()
sales['is_weekend'] = sales['date'].dt.dayofweek >= 5
sales['amount_inr_k'] = (sales['amount'] / 1000).round(2)
# Bucket a numeric column into labelled bands
sales['size'] = pd.cut(sales['amount'],
bins=[0, 500, 1500, float('inf')],
labels=['Small', 'Medium', 'Large'])
print(sales[['amount', 'size', 'month', 'is_weekend']].head(3))| amount | size | month | is_weekend | |
|---|---|---|---|---|
| 0 | 1200 | Medium | January | False |
| 1 | 980 | Medium | January | False |
| 2 | 1450 | Medium | January | True |
Encode categories for modelling
# One-hot encode a categorical column into 0/1 columns
encoded = pd.get_dummies(sales, columns=['region'], prefix='reg')
print([c for c in encoded.columns if c.startswith('reg_')])['reg_East', 'reg_North', 'reg_South', 'reg_West']
Save the clean dataset
# CSV is universal; Parquet is smaller, faster and keeps dtypes
sales.to_csv('sales_clean.csv', index=False)
sales.to_parquet('sales_clean.parquet')
print('Saved clean dataset:', sales.shape)Saved clean dataset: (1000, 9)
- Engineer features with
.dtaccessors, arithmetic, andpd.cutfor banding. pd.get_dummiesone-hot encodes categories into numeric columns models can use.- Save cleaned data once — Parquet preserves dtypes and is far smaller/faster than CSV.
★ Hands-on Project — Build a Clean Sales Dataset
Run a full mini-wrangling pipeline on a realistic, messy dataset and produce a tidy, model-ready table plus a short data-quality report.
- Create or download a CSV of at least 500 sales rows with columns date, region, product, amount — deliberately include missing values, duplicates, inconsistent text casing, and a few amounts stored as text.
- Load it with
pd.read_csv(..., parse_dates=['date'])and inspect withinfo(),describe()andisna().sum(). - Clean it: drop or impute missing values (justify your choice), remove duplicates, coerce
amountto numeric, and standardiseregiontext with.str. - Create a second small
products.csv(product, category, unit_price) andmergeit onproductwith a left join. - Engineer features: a
monthcolumn, anis_weekendflag, and asizeband withpd.cut. - Use
groupbyto produce a region-level summary (sum, mean, count) and apivot_tableof region × product totals. - Write a short markdown data-quality note: how many rows you started/ended with, what you fixed, and any assumptions.
- Save the cleaned dataset to both CSV and Parquet, then commit the notebook and outputs to your GitHub portfolio repo.
Ready to test yourself?
Take the module quiz. Score 70% or more to mark this module complete.
Start the quiz →💡 Log in to save your progress and earn the certificate.