Before any model, a good data scientist looks. Exploratory Data Analysis (EDA) is the disciplined habit of getting to know a dataset — its shape, its quirks, its surprises — before assuming anything. It is part detective work, part storytelling. Done well, EDA catches the broken column that would have wrecked your model, surfaces the relationship worth modelling, and tells you which questions are even answerable. This module pairs the numbers with the pictures: Matplotlib for control, Seaborn for fast statistical plots, and Plotly for interactive charts.
1The EDA mindset & univariate analysis
EDA starts one variable at a time — univariate analysis. For each column you ask: what is its type, its centre, its spread, its shape, and does anything look wrong?
Summarise everything at once
import seaborn as sns
import pandas as pd
tips = sns.load_dataset('tips') # a classic teaching dataset
print(tips.describe().round(2))| total_bill | tip | size | |
|---|---|---|---|
| count | 244.00 | 244.00 | 244.00 |
| mean | 19.79 | 3.00 | 2.57 |
| std | 8.90 | 1.38 | 0.95 |
| min | 3.07 | 1.00 | 1.00 |
| 50% | 17.80 | 2.90 | 2.00 |
| max | 50.81 | 10.00 | 6.00 |
Measure the shape: skew
print('Bill skew:', round(tips['total_bill'].skew(), 2))
print('Tip skew :', round(tips['tip'].skew(), 2))
# Category frequencies
print(tips['day'].value_counts())Bill skew: 1.13 Tip skew : 1.47 day Sat 87 Sun 76 Thur 62 Fri 19 Name: count, dtype: int64
- EDA examines one variable at a time first: type, centre, spread, shape, anomalies.
describe()summarises numerics;value_counts()summarises categories.skew()quantifies asymmetry — positive skew means a long right tail and an inflated mean.
2Plotting fundamentals with Matplotlib
Matplotlib is the foundation every Python chart sits on. The professional pattern is figure + axes: create a canvas, draw onto named axes, and label everything.
The figure/axes pattern
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(7, 4))
ax.hist(tips['total_bill'], bins=20, color='#0e7490', edgecolor='white')
ax.set_title('Distribution of total bill')
ax.set_xlabel('Total bill ($)')
ax.set_ylabel('Number of tables')
fig.tight_layout()
plt.show()Bar chart of a category
by_day = tips.groupby('day')['total_bill'].mean().round(2)
fig, ax = plt.subplots(figsize=(6, 4))
ax.bar(by_day.index, by_day.values, color='#FF6B00')
ax.set_ylabel('Mean bill ($)')
ax.set_title('Average bill by day')
plt.show()plt.plot shortcut) scales to multi-panel figures and keeps your code readable.- Use the figure/axes pattern:
fig, ax = plt.subplots(), then draw and label onax. ax.hist,ax.bar,ax.plot,ax.scattercover most needs.- Every chart needs a title and labelled axes — never ship an unlabelled plot.
3Fast statistical plots with Seaborn
Seaborn sits on top of Matplotlib and makes statistical charts beautiful in one line. It understands DataFrames, so you pass column names directly.
Distribution with a density curve
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style='whitegrid')
sns.histplot(data=tips, x='total_bill', kde=True, color='#0e7490')
plt.title('Total bill with density estimate')
plt.show()Compare groups with a boxplot
# Does the bill differ between lunch and dinner?
sns.boxplot(data=tips, x='time', y='total_bill', hue='time', palette='Set2')
plt.title('Bill by service time')
plt.show()- Seaborn makes statistical charts in one line and reads column names from a DataFrame.
histplot(..., kde=True)overlays a smooth density on the histogram.- Boxplots compare distributions across groups: box = IQR, line = median, whiskers = range.
4Relationships: scatter plots & correlation
Bivariate analysis asks how two variables move together. The scatter plot is the picture; the correlation coefficient is the number.
Scatter with a regression line
sns.regplot(data=tips, x='total_bill', y='tip',
scatter_kws={'alpha': 0.5}, line_kws={'color': '#FF6B00'})
plt.title('Tip vs total bill')
plt.show()The correlation matrix & heatmap
corr = tips[['total_bill', 'tip', 'size']].corr().round(2)
print(corr)
sns.heatmap(corr, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation heatmap')
plt.show()| total_bill | tip | size | |
|---|---|---|---|
| total_bill | 1.00 | 0.68 | 0.60 |
| tip | 0.68 | 1.00 | 0.49 |
| size | 0.60 | 0.49 | 1.00 |
- Scatter plots show pairwise relationships;
regplotadds a trend line. df.corr()gives correlation coefficients from -1 to +1; a heatmap visualises them.- Correlation measures association, never causation — always consider lurking variables.
5Detecting outliers & skew
Outliers can be gold (fraud, a breakthrough) or garbage (a typo). EDA's job is to find them and decide deliberately what to do — never delete silently.
The IQR rule
A common, robust definition: a value is an outlier if it falls below Q1 - 1.5×IQR or above Q3 + 1.5×IQR, where IQR = Q3 - Q1.
q1 = tips['total_bill'].quantile(0.25)
q3 = tips['total_bill'].quantile(0.75)
iqr = q3 - q1
low = q1 - 1.5 * iqr
high = q3 + 1.5 * iqr
outliers = tips[(tips['total_bill'] < low) | (tips['total_bill'] > high)]
print(f'IQR bounds: {low:.2f} to {high:.2f}')
print('Outlier rows:', len(outliers))IQR bounds: -2.82 to 40.30 Outlier rows: 9
Taming skew with a log transform
import numpy as np
tips['log_bill'] = np.log1p(tips['total_bill']) # log(1 + x)
print('Before:', round(tips['total_bill'].skew(), 2))
print('After :', round(tips['log_bill'].skew(), 2))Before: 1.13 After : -0.15
The log transform pulled the long right tail in — skew dropped from 1.13 to roughly 0. Many models prefer this near-symmetric shape.
- The IQR rule flags values beyond
Q1 - 1.5×IQRorQ3 + 1.5×IQR. - A log transform (
np.log1p) reduces right skew and often helps models. - Investigate outliers before removing — they may be the most interesting data you have.
6Interactive charts with Plotly & telling the story
Static charts explain; interactive charts invite exploration. Plotly Express builds hoverable, zoomable charts in one line — perfect for notebooks, dashboards and stakeholder demos.
An interactive scatter in one line
import plotly.express as px
fig = px.scatter(tips, x='total_bill', y='tip',
color='time', size='size',
hover_data=['day'],
title='Tips explorer (hover to inspect)')
fig.show() # fig.write_html('tips.html') to sharePrinciples of an honest, clear chart
- One message per chart. If you are explaining two things, make two charts.
- Start bar-chart axes at zero. A truncated axis exaggerates differences and misleads.
- Label directly where you can, and keep colour meaningful, not decorative.
- Choose the right chart: distribution → histogram/box; relationship → scatter; trend over time → line; parts of a whole → bar (rarely pie).
| Your question | Reach for |
|---|---|
| How is one variable distributed? | Histogram / boxplot |
| How do two numbers relate? | Scatter plot |
| How does a value change over time? | Line chart |
| How do categories compare? | Bar chart |
| How do many variables correlate? | Heatmap / pair plot |
- Plotly Express creates interactive (hover/zoom) charts in one line; export with
write_html. - Honest charts: one message each, zero-based bar axes, meaningful colour, direct labels.
- Match chart to question: distribution, relationship, trend, comparison, or correlation.
★ Hands-on Project — Full EDA on a Real Dataset
Choose a dataset and produce a complete exploratory analysis notebook that a stakeholder could read and trust.
- Pick a dataset (e.g. seaborn's
tips,titanic, or a Kaggle CSV) and load it into pandas. - Profile it:
shape,info(),describe(), missing values, andvalue_counts()for each category. - Univariate: plot the distribution of every numeric column (histogram + KDE) and report skew; bar-chart the key categories.
- Bivariate: build a correlation heatmap and at least two scatter plots of the strongest relationships, each with a trend line.
- Detect outliers with the IQR rule on the main numeric column and decide (with justification) whether to keep, cap or investigate them.
- Apply a log transform to one skewed column and show the before/after skew and histograms side by side.
- Make one interactive Plotly chart and export it to HTML.
- Write a short 'Five things I learned' markdown summary at the top of the notebook, then commit it to your portfolio.
Ready to test yourself?
Take the module quiz. Score 70% or more to mark this module complete.
Start the quiz →💡 Log in to save your progress and earn the certificate.