🔍 Module 3

Exploratory Data Analysis & Visualisation

⏱ 14 hoursIntermediate6 topics

🎯 By the end: explore a new dataset systematically, summarise and visualise single variables and relationships, detect outliers and skew, and build clear, honest charts with Matplotlib, Seaborn and Plotly.

Before any model, a good data scientist looks. Exploratory Data Analysis (EDA) is the disciplined habit of getting to know a dataset — its shape, its quirks, its surprises — before assuming anything. It is part detective work, part storytelling. Done well, EDA catches the broken column that would have wrecked your model, surfaces the relationship worth modelling, and tells you which questions are even answerable. This module pairs the numbers with the pictures: Matplotlib for control, Seaborn for fast statistical plots, and Plotly for interactive charts.

1The EDA mindset & univariate analysis

EDA starts one variable at a time — univariate analysis. For each column you ask: what is its type, its centre, its spread, its shape, and does anything look wrong?

Summarise everything at once

import seaborn as sns
import pandas as pd

tips = sns.load_dataset('tips')      # a classic teaching dataset
print(tips.describe().round(2))

tips.describe()
	total_bill	tip	size
count	244.00	244.00	244.00
mean	19.79	3.00	2.57
std	8.90	1.38	0.95
min	3.07	1.00	1.00
50%	17.80	2.90	2.00
max	50.81	10.00	6.00

Measure the shape: skew

print('Bill skew:', round(tips['total_bill'].skew(), 2))
print('Tip skew :', round(tips['tip'].skew(), 2))

# Category frequencies
print(tips['day'].value_counts())

▶ Output

Bill skew: 1.13
Tip skew : 1.47
day
Sat     87
Sun     76
Thur    62
Fri     19
Name: count, dtype: int64

Read skew like a compass. Skew near 0 is roughly symmetric; positive skew (like 1.13 here) means a long right tail — a few big bills pull the mean above the median. That single number tells you the mean will overstate the “typical” bill.

Key points

EDA examines one variable at a time first: type, centre, spread, shape, anomalies.
describe() summarises numerics; value_counts() summarises categories.
skew() quantifies asymmetry — positive skew means a long right tail and an inflated mean.

2Plotting fundamentals with Matplotlib

Matplotlib is the foundation every Python chart sits on. The professional pattern is figure + axes: create a canvas, draw onto named axes, and label everything.

The figure/axes pattern

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(7, 4))
ax.hist(tips['total_bill'], bins=20, color='#0e7490', edgecolor='white')

ax.set_title('Distribution of total bill')
ax.set_xlabel('Total bill ($)')
ax.set_ylabel('Number of tables')
fig.tight_layout()
plt.show()

A histogram of total bills — clearly right-skewed, matching the skew of 1.13.

Bar chart of a category

by_day = tips.groupby('day')['total_bill'].mean().round(2)

fig, ax = plt.subplots(figsize=(6, 4))
ax.bar(by_day.index, by_day.values, color='#FF6B00')
ax.set_ylabel('Mean bill ($)')
ax.set_title('Average bill by day')
plt.show()

Always label axes and titles. An unlabelled chart is a guessing game. The figure/axes pattern (over the quick plt.plot shortcut) scales to multi-panel figures and keeps your code readable.

Key points

Use the figure/axes pattern: fig, ax = plt.subplots(), then draw and label on ax.
ax.hist, ax.bar, ax.plot, ax.scatter cover most needs.
Every chart needs a title and labelled axes — never ship an unlabelled plot.

3Fast statistical plots with Seaborn

Seaborn sits on top of Matplotlib and makes statistical charts beautiful in one line. It understands DataFrames, so you pass column names directly.

Distribution with a density curve

import seaborn as sns
import matplotlib.pyplot as plt

sns.set_theme(style='whitegrid')
sns.histplot(data=tips, x='total_bill', kde=True, color='#0e7490')
plt.title('Total bill with density estimate')
plt.show()

Compare groups with a boxplot

# Does the bill differ between lunch and dinner?
sns.boxplot(data=tips, x='time', y='total_bill', hue='time', palette='Set2')
plt.title('Bill by service time')
plt.show()

A boxplot shows median (thick line), the middle 50% (box) and the range (whiskers) — dinner bills run higher.

How to read a boxplot. The box spans the 25th–75th percentile (the IQR), the line inside is the median, and the whiskers reach the typical range. Points beyond the whiskers are candidate outliers. One glance compares whole distributions across groups.

Key points

Seaborn makes statistical charts in one line and reads column names from a DataFrame.
histplot(..., kde=True) overlays a smooth density on the histogram.
Boxplots compare distributions across groups: box = IQR, line = median, whiskers = range.

4Relationships: scatter plots & correlation

Bivariate analysis asks how two variables move together. The scatter plot is the picture; the correlation coefficient is the number.

Scatter with a regression line

sns.regplot(data=tips, x='total_bill', y='tip',
            scatter_kws={'alpha': 0.5}, line_kws={'color': '#FF6B00'})
plt.title('Tip vs total bill')
plt.show()

Bigger bills tend to bring bigger tips — a clear positive relationship (r ≈ 0.68).

The correlation matrix & heatmap

corr = tips[['total_bill', 'tip', 'size']].corr().round(2)
print(corr)

sns.heatmap(corr, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation heatmap')
plt.show()

correlation matrix
	total_bill	tip	size
total_bill	1.00	0.68	0.60
tip	0.68	1.00	0.49
size	0.60	0.49	1.00

Correlation is not causation. A correlation of 0.68 says bill and tip move together — not that one causes the other. Both may be driven by a third factor (party size). Keep that discipline; it separates analysts from headline writers.

Key points

Scatter plots show pairwise relationships; regplot adds a trend line.
df.corr() gives correlation coefficients from -1 to +1; a heatmap visualises them.
Correlation measures association, never causation — always consider lurking variables.

5Detecting outliers & skew

Outliers can be gold (fraud, a breakthrough) or garbage (a typo). EDA's job is to find them and decide deliberately what to do — never delete silently.

The IQR rule

A common, robust definition: a value is an outlier if it falls below Q1 - 1.5×IQR or above Q3 + 1.5×IQR, where IQR = Q3 - Q1.

q1 = tips['total_bill'].quantile(0.25)
q3 = tips['total_bill'].quantile(0.75)
iqr = q3 - q1

low  = q1 - 1.5 * iqr
high = q3 + 1.5 * iqr

outliers = tips[(tips['total_bill'] < low) | (tips['total_bill'] > high)]
print(f'IQR bounds: {low:.2f} to {high:.2f}')
print('Outlier rows:', len(outliers))

▶ Output

IQR bounds: -2.82 to 40.30
Outlier rows: 9

Taming skew with a log transform

import numpy as np

tips['log_bill'] = np.log1p(tips['total_bill'])   # log(1 + x)
print('Before:', round(tips['total_bill'].skew(), 2))
print('After :', round(tips['log_bill'].skew(), 2))

▶ Output

Before: 1.13
After : -0.15

The log transform pulled the long right tail in — skew dropped from 1.13 to roughly 0. Many models prefer this near-symmetric shape.

Investigate before you remove. An outlier is a question, not an error. Check whether it is a real extreme value or a data-entry mistake; document your decision either way. Removing real data to make a chart prettier is how analyses go wrong.

Key points

The IQR rule flags values beyond Q1 - 1.5×IQR or Q3 + 1.5×IQR.
A log transform (np.log1p) reduces right skew and often helps models.
Investigate outliers before removing — they may be the most interesting data you have.

6Interactive charts with Plotly & telling the story

Static charts explain; interactive charts invite exploration. Plotly Express builds hoverable, zoomable charts in one line — perfect for notebooks, dashboards and stakeholder demos.

An interactive scatter in one line

import plotly.express as px

fig = px.scatter(tips, x='total_bill', y='tip',
                 color='time', size='size',
                 hover_data=['day'],
                 title='Tips explorer (hover to inspect)')
fig.show()                       # fig.write_html('tips.html') to share

Principles of an honest, clear chart

One message per chart. If you are explaining two things, make two charts.
Start bar-chart axes at zero. A truncated axis exaggerates differences and misleads.
Label directly where you can, and keep colour meaningful, not decorative.
Choose the right chart: distribution → histogram/box; relationship → scatter; trend over time → line; parts of a whole → bar (rarely pie).

Your question	Reach for
How is one variable distributed?	Histogram / boxplot
How do two numbers relate?	Scatter plot
How does a value change over time?	Line chart
How do categories compare?	Bar chart
How do many variables correlate?	Heatmap / pair plot

EDA is where insight is born. The best data scientists spend real time here. A model can only find what the data contains — and EDA is how you discover what that is, what is broken, and what is worth asking.

Key points

Plotly Express creates interactive (hover/zoom) charts in one line; export with write_html.
Honest charts: one message each, zero-based bar axes, meaningful colour, direct labels.
Match chart to question: distribution, relationship, trend, comparison, or correlation.

★ Hands-on Project — Full EDA on a Real Dataset

Choose a dataset and produce a complete exploratory analysis notebook that a stakeholder could read and trust.

Pick a dataset (e.g. seaborn's tips, titanic, or a Kaggle CSV) and load it into pandas.
Profile it: shape, info(), describe(), missing values, and value_counts() for each category.
Univariate: plot the distribution of every numeric column (histogram + KDE) and report skew; bar-chart the key categories.
Bivariate: build a correlation heatmap and at least two scatter plots of the strongest relationships, each with a trend line.
Detect outliers with the IQR rule on the main numeric column and decide (with justification) whether to keep, cap or investigate them.
Apply a log transform to one skewed column and show the before/after skew and histograms side by side.
Make one interactive Plotly chart and export it to HTML.
Write a short 'Five things I learned' markdown summary at the top of the notebook, then commit it to your portfolio.

Ready to test yourself?

Take the module quiz. Score 70% or more to mark this module complete.

Start the quiz →

💡 Log in to save your progress and earn the certificate.

← Previous

Data Acquisition & Wrangling with pandas

Statistics & Probability for Data Science