🧹 Module 2

Data Acquisition & Wrangling with pandas

⏱ 16 hoursBeginner–Intermediate6 topics

🎯 By the end: load data from any common source into a pandas DataFrame, select and filter it fluently, fix missing values, duplicates and bad types, and merge, reshape and aggregate it into the tidy tables every model needs.

Data scientists famously spend most of their time not modelling, but wrangling — getting data from somewhere, then beating it into a clean, tidy shape. The tool for that job is pandas, the most important library in applied data science. If Module 1 gave you the engine (NumPy), pandas is the whole vehicle: labelled rows and columns, missing-value handling, joins, grouping and reshaping, all vectorised. Master this module and the rest of the course becomes far easier, because every model starts from a clean DataFrame you built.

1From NumPy to pandas: Series & DataFrame

pandas adds labels to NumPy's fast arrays. A Series is one labelled column; a DataFrame is a whole table — rows and named columns — exactly like a spreadsheet or a database table, but programmable.

A DataFrame = a shared index + one Series per column. Each column has its own type.

Build a DataFrame from scratch

import pandas as pd

df = pd.DataFrame({
    'name': ['Asha', 'Ravi', 'Meera', 'John'],
    'age':  [30, 25, 35, 28],
    'city': ['Mumbai', 'Delhi', 'Pune', 'Delhi'],
})

print(df)
print('Shape:', df.shape)
print('Dtypes:\n', df.dtypes)

▶ Output

    name  age    city
0   Asha   30  Mumbai
1   Ravi   25   Delhi
2  Meera   35    Pune
3   John   28   Delhi
Shape: (4, 3)
Dtypes:
 name    object
age      int64
city    object
dtype: object

dtype object usually means text. Numeric columns become int64 or float64. Checking df.dtypes early saves hours — a number stored as text will not add up.

Key points

A Series is one labelled column; a DataFrame is a table of aligned Series sharing an index.
df.shape gives (rows, columns); df.dtypes shows each column's type.
object dtype usually means text — verify types before doing maths.

2Reading data from anywhere: CSV, Excel, JSON, SQL, APIs

Real projects pull data from many places. pandas has a read_* function for each, and they all return a DataFrame so the rest of your code is identical.

Source	One-liner
CSV file	`pd.read_csv('data.csv')`
Excel sheet	`pd.read_excel('book.xlsx', sheet_name='Sales')`
JSON / API	`pd.read_json('records.json')`
SQL database	`pd.read_sql('SELECT * FROM sales', conn)`
HTML table on a page	`pd.read_html('https://...')[0]`

Read a CSV and look at it

import pandas as pd

sales = pd.read_csv('sales.csv')      # parse_dates=['date'] for real dates
print(sales.head(3))                   # first 3 rows
print('Rows, cols:', sales.shape)

sales.head(3)
	date	region	product	amount
0	2024-01-01	North	Widget	1200
1	2024-01-01	South	Gadget	980
2	2024-01-02	North	Widget	1450

Pull live data from an API

Most public data APIs return JSON. Fetch it, then hand it to pandas.

import requests
import pandas as pd

url = 'https://api.example.com/v1/exchange-rates'
resp = requests.get(url, timeout=10)
resp.raise_for_status()               # fail loudly on a bad response

data = resp.json()                    # JSON  ->  Python list/dict
rates = pd.DataFrame(data['rates'])
print(rates.head())

Always set a timeout and call raise_for_status(). A silent network failure that returns half a page is how corrupt data sneaks into a pipeline. Fail early and loudly.

Key points

Every pd.read_* function returns a DataFrame, so downstream code stays the same.
Use parse_dates= on read so date columns become real datetimes, not text.
For APIs: set a timeout, call raise_for_status(), then build a DataFrame from the JSON.

3Inspecting & selecting: head, info, loc, iloc & filtering

Before you change anything, look. Then learn the two ways to select: by label (loc) and by position (iloc). Filtering with Boolean masks — the idea from Module 1 — is how you ask questions of your data.

Get the lay of the land

sales.info()              # columns, non-null counts, dtypes, memory
sales.describe()          # count, mean, std, min, quartiles, max
sales['region'].value_counts()   # how many rows per category

▶ Output (value_counts)

region
North    312
South    288
East     205
West     195
Name: count, dtype: int64

Select columns, rows & cells

# one column (a Series) and several columns (a DataFrame)
amounts = sales['amount']
subset  = sales[['region', 'amount']]

# label-based vs position-based
first_label = sales.loc[0, 'amount']     # by index label + column name
first_pos   = sales.iloc[0, 3]           # by row 0, column 3
print(first_label, first_pos)

▶ Output

1200 1200

Filter with Boolean masks

# Big sales in the North region
mask = (sales['amount'] > 1000) & (sales['region'] == 'North')
big_north = sales[mask]

print('Matching rows:', len(big_north))
print('Their mean amount:', round(big_north['amount'].mean(), 2))

▶ Output

Matching rows: 142
Their mean amount: 1487.36

Combine conditions with &, |, ~ — and wrap each in parentheses. Python's and/or do not work element-wise on Series, and missing parentheses cause confusing errors.

Key points

Start with info(), describe() and value_counts() to understand a new dataset.
loc selects by label, iloc by integer position — keep them straight.
Filter with Boolean masks; combine conditions with & | ~ and parenthesise each one.

4Cleaning: missing values, duplicates, types & text

Real data is dirty. The four recurring problems are missing values, duplicates, wrong types and messy text. Cleaning them is the difference between a trustworthy result and a confident lie.

Find and handle missing values

print(sales.isna().sum())             # missing count per column

# Strategy 1: drop rows missing a critical field
sales = sales.dropna(subset=['amount'])

# Strategy 2: fill with a sensible default
sales['region'] = sales['region'].fillna('Unknown')

# Strategy 3: fill numbers with the column median (robust to outliers)
median_amt = sales['amount'].median()
sales['amount'] = sales['amount'].fillna(median_amt)

▶ Output (isna().sum())

date        0
region      4
product     0
amount     11
dtype: int64

Remove duplicates & fix types

before = len(sales)
sales = sales.drop_duplicates()
print('Removed', before - len(sales), 'duplicate rows')

# Convert a text column that should be numeric
sales['amount'] = pd.to_numeric(sales['amount'], errors='coerce')
# Parse a date column properly
sales['date'] = pd.to_datetime(sales['date'])

▶ Output

Removed 7 duplicate rows

Clean messy text with .str

# " North  ", "north", "NORTH" should all be "North"
sales['region'] = (sales['region']
                   .str.strip()        # trim spaces
                   .str.title())       # Title Case

print(sales['region'].unique())

▶ Output

['North' 'South' 'East' 'West' 'Unknown']

Choose imputation deliberately. Dropping rows loses data; filling with the mean is sensitive to outliers; the median is safer for skewed numbers. Always record what you did and why — it affects every result downstream.

Key points

isna().sum() reveals missingness; choose to drop, fill, or impute deliberately.
drop_duplicates() removes repeats; to_numeric/to_datetime fix bad types (use errors='coerce').
The .str accessor (strip, lower, title, replace) standardises messy text columns.

5Combining & reshaping: merge, concat, groupby, pivot

Insight usually lives across tables and in the right shape. Four moves cover almost everything: merge (join on a key), concat (stack), groupby (split-apply-combine) and pivot/melt (reshape wide↔long).

Merge two tables on a key

# sales has a 'product' column; products has price & category
merged = sales.merge(products, on='product', how='left')
print(merged[['product', 'amount', 'category']].head(3))

A left join keeps all rows from the left table and attaches matching columns from the right.

Group, aggregate & answer business questions

# Total and average sales per region
summary = (sales
           .groupby('region')['amount']
           .agg(['sum', 'mean', 'count'])
           .round(2)
           .sort_values('sum', ascending=False))
print(summary)

sales by region
region	sum	mean	count
North	464160	1487.69	312
South	361440	1255.00	288
East	240875	1175.00	205
West	222300	1140.00	195

Reshape with pivot_table

# Region down the side, product across the top, summing amount
wide = sales.pivot_table(index='region', columns='product',
                         values='amount', aggfunc='sum', fill_value=0)
print(wide)

groupby is the workhorse of analysis. “Split into groups, apply a calculation, combine the results” answers most business questions — revenue by month, churn by plan, score by class. You will reach for it daily.

Key points

merge(..., on=key, how='left') joins tables; concat stacks them.
groupby(col)[target].agg([...]) is split-apply-combine — the core of analysis.
pivot_table reshapes long data into a wide summary; melt does the reverse.

6Feature engineering & saving tidy data

The final wrangling step turns clean data into features a model can learn from, then saves the result so the expensive cleaning never has to run twice.

Create new columns

# Derive useful features from existing columns
sales['month'] = sales['date'].dt.month_name()
sales['is_weekend'] = sales['date'].dt.dayofweek >= 5
sales['amount_inr_k'] = (sales['amount'] / 1000).round(2)

# Bucket a numeric column into labelled bands
sales['size'] = pd.cut(sales['amount'],
                       bins=[0, 500, 1500, float('inf')],
                       labels=['Small', 'Medium', 'Large'])
print(sales[['amount', 'size', 'month', 'is_weekend']].head(3))

	amount	size	month	is_weekend
0	1200	Medium	January	False
1	980	Medium	January	False
2	1450	Medium	January	True

Encode categories for modelling

# One-hot encode a categorical column into 0/1 columns
encoded = pd.get_dummies(sales, columns=['region'], prefix='reg')
print([c for c in encoded.columns if c.startswith('reg_')])

▶ Output

['reg_East', 'reg_North', 'reg_South', 'reg_West']

Save the clean dataset

# CSV is universal; Parquet is smaller, faster and keeps dtypes
sales.to_csv('sales_clean.csv', index=False)
sales.to_parquet('sales_clean.parquet')
print('Saved clean dataset:', sales.shape)

▶ Output

Saved clean dataset: (1000, 9)

Prefer Parquet for intermediate data. It is columnar, compressed and remembers each column's dtype — so your cleaned dates and categories survive a round-trip, unlike CSV which stores everything as text.

Key points

Engineer features with .dt accessors, arithmetic, and pd.cut for banding.
pd.get_dummies one-hot encodes categories into numeric columns models can use.
Save cleaned data once — Parquet preserves dtypes and is far smaller/faster than CSV.

★ Hands-on Project — Build a Clean Sales Dataset

Run a full mini-wrangling pipeline on a realistic, messy dataset and produce a tidy, model-ready table plus a short data-quality report.

Create or download a CSV of at least 500 sales rows with columns date, region, product, amount — deliberately include missing values, duplicates, inconsistent text casing, and a few amounts stored as text.
Load it with pd.read_csv(..., parse_dates=['date']) and inspect with info(), describe() and isna().sum().
Clean it: drop or impute missing values (justify your choice), remove duplicates, coerce amount to numeric, and standardise region text with .str.
Create a second small products.csv (product, category, unit_price) and merge it on product with a left join.
Engineer features: a month column, an is_weekend flag, and a size band with pd.cut.
Use groupby to produce a region-level summary (sum, mean, count) and a pivot_table of region × product totals.
Write a short markdown data-quality note: how many rows you started/ended with, what you fixed, and any assumptions.
Save the cleaned dataset to both CSV and Parquet, then commit the notebook and outputs to your GitHub portfolio repo.

Ready to test yourself?

Take the module quiz. Score 70% or more to mark this module complete.

Start the quiz →

💡 Log in to save your progress and earn the certificate.

← Previous

Scientific Python & Mathematical Foundations

Exploratory Data Analysis & Visualisation