🔍 Module 4

Exploratory Data Analysis (EDA)

⏱ 12 hoursIntermediate6 topics

🎯 By the end: profile any dataset with descriptive statistics, read distributions, detect outliers with the IQR and Z-score methods, measure correlation, compare groups, and turn findings into clear, evidence-backed business insights.

EDA is detective work. Before you model or chart anything for an audience, you interrogate the data: What does it look like? What is typical? What is weird? What moves together? Done well, EDA finds the story — and stops you from confidently reporting nonsense.

1Descriptive statistics: the first look

One method, describe(), gives you a fast statistical x-ray of every numeric column.

import pandas as pd
df = pd.read_csv('orders.csv')
print(df.describe().round(1))

	amount	quantity	discount
count	500.0	500.0	500.0
mean	1284.6	3.2	0.11
std	612.3	1.8	0.09
min	120.0	1.0	0.00
50%	1180.0	3.0	0.10
max	9800.0	12.0	0.45

Centre, spread & shape

Centre: mean (average) vs median (the 50% middle value).
Spread: std (standard deviation) — how widely values vary.
Shape: skew — is the data lop-sided?

print('Mean  :', round(df['amount'].mean(), 1))
print('Median:', df['amount'].median())
print('Skew  :', round(df['amount'].skew(), 2))

▶ Output

Mean  : 1284.6
Median: 1180.0
Skew  : 1.87

Mean > median + positive skew tells you a few very large orders are pulling the average up. For a “typical” value here, the median is more honest than the mean.

Key points

df.describe() summarises count, mean, std, min, quartiles and max at a glance.
Compare mean vs median to spot skew; report the median for typical values in skewed data.
Centre, spread and shape are the three questions every column should answer.

2Reading distributions: histograms & boxplots

Numbers summarise; pictures reveal. A histogram shows the shape of a single variable; a boxplot shows its spread and outliers.

import matplotlib.pyplot as plt

df['amount'].plot(kind='hist', bins=20, edgecolor='white')
plt.title('Distribution of order amounts')
plt.xlabel('Amount'); plt.show()

A right-skewed distribution — most orders are small, a long tail of large ones.

A boxplot draws the median (line), the middle 50% (the box, from Q1 to Q3), the “whiskers”, and dots for outliers.

The box is the middle 50% of the data; the red dot is a flagged outlier.

Pick the right picture: use a histogram to judge shape and modes, a boxplot to compare groups and spot outliers fast.

Key points

Histograms reveal a variable's shape (symmetric, skewed, multi-peaked).
Boxplots show the median, the interquartile box (Q1–Q3), whiskers and outliers.
Always look at distributions before trusting a single summary number.

3Detecting outliers: IQR & Z-score

Outliers can be gold (fraud, VIP customers) or garbage (typos, sensor errors). Two standard methods flag them objectively.

The IQR method (robust, distribution-free)

Q1 = df['amount'].quantile(0.25)
Q3 = df['amount'].quantile(0.75)
IQR = Q3 - Q1

low  = Q1 - 1.5 * IQR
high = Q3 + 1.5 * IQR
outliers = df[(df['amount'] < low) | (df['amount'] > high)]

print('Bounds:', round(low, 1), 'to', round(high, 1))
print('Outliers found:', len(outliers))

▶ Output

Bounds: -393.0 to 2877.0
Outliers found: 23

The Z-score method (for roughly normal data)

from scipy import stats

z = stats.zscore(df['amount'])
extreme = df[abs(z) > 3]      # more than 3 std devs from the mean
print('Extreme values (|z| > 3):', len(extreme))

▶ Output

Extreme values (|z| > 3): 9

Never delete outliers automatically. Investigate each one. A ₹9,800 order may be a real bulk purchase, not an error. Removing real extremes hides the most interesting part of your data.

Key points

IQR method: flag values below Q1 − 1.5×IQR or above Q3 + 1.5×IQR — robust to skew.
Z-score method: flag |z| > 3 — best for roughly normal data.
Outliers are clues, not trash — investigate before removing anything.

4Correlation analysis & heatmaps

Which variables move together? corr() gives a correlation matrix (values from −1 to +1); a heatmap makes it readable at a glance.

import seaborn as sns
import matplotlib.pyplot as plt

corr = df[['amount', 'discount', 'profit']].corr()
print(corr.round(2))

sns.heatmap(corr, annot=True, cmap='coolwarm', center=0)
plt.show()

Amount & profit rise together (0.76); discount pulls profit down (−0.55).

Correlation is not causation. Ice-cream sales and drowning deaths correlate — because both rise in summer, not because one causes the other. A correlation is a clue to investigate, never proof.

Key points

df.corr() measures linear relationships, from −1 (opposite) to +1 (together).
A heatmap turns the correlation matrix into an at-a-glance picture.
Correlation ≠ causation — always look for confounding factors.

5Bivariate & group comparisons

Single-variable summaries are step one. The insights usually live in relationships — how one variable changes across the values of another.

Compare a metric across groups

# average profit by customer segment
by_segment = df.groupby('segment')['profit'].agg(['mean', 'count']).round(1)
print(by_segment)

segment	mean	count
Consumer	118.4	262
Corporate	171.9	152
Home Office	142.1	86

Cross-tab two categories

# how many orders per region x category?
print(pd.crosstab(df['region'], df['category']))

▶ Output

category  Furniture  Office  Tech
region
East             41      88    63
North            55     102    77
South            38      71    49

Scatter for two numbers, box/bar for number-by-category. To see if discount erodes profit, a scatter of discount vs profit (with a trend line) tells the story instantly.

Key points

groupby(cat)[metric].agg(...) compares a number across categories.
pd.crosstab() counts the relationship between two categorical variables.
Choose the comparison by data type: scatter (number vs number), box/bar (number vs category).

6Automated profiling & communicating findings

For a fast, thorough first pass, an automated profiler generates a full EDA report — distributions, missing values, correlations and warnings — in a couple of lines.

from ydata_profiling import ProfileReport

profile = ProfileReport(df, title='Orders EDA')
profile.to_file('orders_eda.html')   # an interactive HTML report

Automate the boring part, not the thinking. The profiler surfaces facts fast, but only you can decide which facts matter to the business question. Treat its output as a starting map, not the destination.

Turn findings into insight

An analyst is paid for insight, not output. Frame every finding in four parts:

Part	Example
Question	Why is profit falling in the South?
Finding	South gives the deepest discounts (avg 18% vs 9%).
Evidence	Discount↔profit correlation = −0.55; group means confirm it.
Recommendation	Cap South discounts at 12% and re-measure profit next quarter.

The insight test: a finding is only useful if someone could act on it. Always end EDA with “so what should we do?” — that is what turns an analyst into a trusted advisor.

Key points

ydata-profiling auto-generates a full EDA report for a fast first pass.
Automation surfaces facts; judgement decides which matter — keep the thinking human.
Communicate insight as Question → Finding → Evidence → Recommendation.

★ Hands-on Project — Full EDA Report

Run a complete, documented EDA on a real dataset (HR attrition, retail Superstore, or any open dataset) and surface at least five actionable insights backed by evidence.

Load the dataset and profile it: df.shape, df.info(), df.describe(), df.isna().sum().
For 3–4 key numeric columns, plot a histogram and a boxplot and describe each distribution's shape.
Detect outliers with the IQR method; investigate (do not auto-delete) the most extreme ones and note what they are.
Build a correlation matrix and heatmap; identify the two strongest relationships and one surprising one.
Compare a core metric (e.g. profit) across at least two categories using groupby and a crosstab.
Run an automated profile with ydata-profiling and skim it for anything you missed.
Write up at least five insights in the Question → Finding → Evidence → Recommendation format.
Save your notebook with charts and narrative, and push it to GitHub as a portfolio piece.

Ready to test yourself?

Take the module quiz. Score 70% or more to mark this module complete.

Start the quiz →

💡 Log in to save your progress and earn the certificate.

← Previous

Data Wrangling with Pandas

Data Visualisation