EDA is detective work. Before you model or chart anything for an audience, you interrogate the data: What does it look like? What is typical? What is weird? What moves together? Done well, EDA finds the story — and stops you from confidently reporting nonsense.
1Descriptive statistics: the first look
One method, describe(), gives you a fast statistical x-ray of every numeric column.
import pandas as pd
df = pd.read_csv('orders.csv')
print(df.describe().round(1))| amount | quantity | discount | |
|---|---|---|---|
| count | 500.0 | 500.0 | 500.0 |
| mean | 1284.6 | 3.2 | 0.11 |
| std | 612.3 | 1.8 | 0.09 |
| min | 120.0 | 1.0 | 0.00 |
| 50% | 1180.0 | 3.0 | 0.10 |
| max | 9800.0 | 12.0 | 0.45 |
Centre, spread & shape
- Centre:
mean(average) vsmedian(the 50% middle value). - Spread:
std(standard deviation) — how widely values vary. - Shape:
skew— is the data lop-sided?
print('Mean :', round(df['amount'].mean(), 1))
print('Median:', df['amount'].median())
print('Skew :', round(df['amount'].skew(), 2))Mean : 1284.6 Median: 1180.0 Skew : 1.87
df.describe()summarises count, mean, std, min, quartiles and max at a glance.- Compare mean vs median to spot skew; report the median for typical values in skewed data.
- Centre, spread and shape are the three questions every column should answer.
2Reading distributions: histograms & boxplots
Numbers summarise; pictures reveal. A histogram shows the shape of a single variable; a boxplot shows its spread and outliers.
import matplotlib.pyplot as plt
df['amount'].plot(kind='hist', bins=20, edgecolor='white')
plt.title('Distribution of order amounts')
plt.xlabel('Amount'); plt.show()A boxplot draws the median (line), the middle 50% (the box, from Q1 to Q3), the “whiskers”, and dots for outliers.
- Histograms reveal a variable's shape (symmetric, skewed, multi-peaked).
- Boxplots show the median, the interquartile box (Q1–Q3), whiskers and outliers.
- Always look at distributions before trusting a single summary number.
3Detecting outliers: IQR & Z-score
Outliers can be gold (fraud, VIP customers) or garbage (typos, sensor errors). Two standard methods flag them objectively.
The IQR method (robust, distribution-free)
Q1 = df['amount'].quantile(0.25)
Q3 = df['amount'].quantile(0.75)
IQR = Q3 - Q1
low = Q1 - 1.5 * IQR
high = Q3 + 1.5 * IQR
outliers = df[(df['amount'] < low) | (df['amount'] > high)]
print('Bounds:', round(low, 1), 'to', round(high, 1))
print('Outliers found:', len(outliers))Bounds: -393.0 to 2877.0 Outliers found: 23
The Z-score method (for roughly normal data)
from scipy import stats
z = stats.zscore(df['amount'])
extreme = df[abs(z) > 3] # more than 3 std devs from the mean
print('Extreme values (|z| > 3):', len(extreme))Extreme values (|z| > 3): 9
- IQR method: flag values below
Q1 − 1.5×IQRor aboveQ3 + 1.5×IQR— robust to skew. - Z-score method: flag
|z| > 3— best for roughly normal data. - Outliers are clues, not trash — investigate before removing anything.
4Correlation analysis & heatmaps
Which variables move together? corr() gives a correlation matrix (values from −1 to +1); a heatmap makes it readable at a glance.
import seaborn as sns
import matplotlib.pyplot as plt
corr = df[['amount', 'discount', 'profit']].corr()
print(corr.round(2))
sns.heatmap(corr, annot=True, cmap='coolwarm', center=0)
plt.show()df.corr()measures linear relationships, from −1 (opposite) to +1 (together).- A heatmap turns the correlation matrix into an at-a-glance picture.
- Correlation ≠ causation — always look for confounding factors.
5Bivariate & group comparisons
Single-variable summaries are step one. The insights usually live in relationships — how one variable changes across the values of another.
Compare a metric across groups
# average profit by customer segment
by_segment = df.groupby('segment')['profit'].agg(['mean', 'count']).round(1)
print(by_segment)| segment | mean | count |
|---|---|---|
| Consumer | 118.4 | 262 |
| Corporate | 171.9 | 152 |
| Home Office | 142.1 | 86 |
Cross-tab two categories
# how many orders per region x category?
print(pd.crosstab(df['region'], df['category']))category Furniture Office Tech region East 41 88 63 North 55 102 77 South 38 71 49
discount vs profit (with a trend line) tells the story instantly.groupby(cat)[metric].agg(...)compares a number across categories.pd.crosstab()counts the relationship between two categorical variables.- Choose the comparison by data type: scatter (number vs number), box/bar (number vs category).
6Automated profiling & communicating findings
For a fast, thorough first pass, an automated profiler generates a full EDA report — distributions, missing values, correlations and warnings — in a couple of lines.
from ydata_profiling import ProfileReport
profile = ProfileReport(df, title='Orders EDA')
profile.to_file('orders_eda.html') # an interactive HTML reportTurn findings into insight
An analyst is paid for insight, not output. Frame every finding in four parts:
| Part | Example |
|---|---|
| Question | Why is profit falling in the South? |
| Finding | South gives the deepest discounts (avg 18% vs 9%). |
| Evidence | Discount↔profit correlation = −0.55; group means confirm it. |
| Recommendation | Cap South discounts at 12% and re-measure profit next quarter. |
ydata-profilingauto-generates a full EDA report for a fast first pass.- Automation surfaces facts; judgement decides which matter — keep the thinking human.
- Communicate insight as Question → Finding → Evidence → Recommendation.
★ Hands-on Project — Full EDA Report
Run a complete, documented EDA on a real dataset (HR attrition, retail Superstore, or any open dataset) and surface at least five actionable insights backed by evidence.
- Load the dataset and profile it:
df.shape,df.info(),df.describe(),df.isna().sum(). - For 3–4 key numeric columns, plot a histogram and a boxplot and describe each distribution's shape.
- Detect outliers with the IQR method; investigate (do not auto-delete) the most extreme ones and note what they are.
- Build a correlation matrix and heatmap; identify the two strongest relationships and one surprising one.
- Compare a core metric (e.g. profit) across at least two categories using
groupbyand a crosstab. - Run an automated profile with
ydata-profilingand skim it for anything you missed. - Write up at least five insights in the Question → Finding → Evidence → Recommendation format.
- Save your notebook with charts and narrative, and push it to GitHub as a portfolio piece.
Ready to test yourself?
Take the module quiz. Score 70% or more to mark this module complete.
Start the quiz →💡 Log in to save your progress and earn the certificate.