📐 Module 4

Statistics & Probability for Data Science

⏱ 16 hoursIntermediate6 topics

🎯 By the end: work with probability distributions in SciPy, apply the Central Limit Theorem, build confidence intervals, run and correctly interpret hypothesis tests and A/B tests, and reason about results without fooling yourself.

Machine learning will happily fit a model to noise and report a confident, wrong answer. Statistics is the discipline that stops that happening. It is how you tell a real signal from a fluke, put honest error bars on an estimate, and decide whether version B truly beats version A or just got lucky. This module is practical, not theorem-heavy: every idea comes with SciPy or statsmodels code and a plain-English interpretation. Get this right and you will trust your own conclusions — and so will everyone reading them.

1Descriptive vs inferential statistics

Two jobs. Descriptive statistics summarise the data you have. Inferential statistics use a sample to draw conclusions about a larger population you cannot fully measure. Almost all data science is inference: we never have every customer, only some.

We measure a sample, then infer the population's parameters — with quantified uncertainty.

Centre and spread, precisely

import numpy as np

sample = np.array([12, 15, 14, 10, 18, 22, 16, 13, 19, 25])

print('Mean    :', sample.mean())
print('Median  :', np.median(sample))
print('Variance:', round(sample.var(ddof=1), 2))    # sample variance
print('Std dev :', round(sample.std(ddof=1), 2))

▶ Output

Mean    : 16.4
Median  : 15.5
Variance: 21.82
Std dev : 4.67

Use ddof=1 for a sample. Dividing by n−1 (not n) corrects the bias when estimating a population's variance from a sample — this is Bessel's correction, and it is the default you want for real data.

Key points

Descriptive statistics summarise data you have; inferential statistics generalise from a sample to a population.
Mean and median capture centre; variance and standard deviation capture spread.
Use ddof=1 (n−1) for sample variance/standard deviation — Bessel's correction.

2Probability distributions with SciPy

A distribution describes how likely each outcome is. SciPy's stats module gives every common one a consistent toolkit: pdf/pmf (density), cdf (probability up to a point) and ppf (the inverse — percentiles).

The normal distribution

from scipy import stats

# Daily visitors ~ Normal(mean=500, sd=80)
rv = stats.norm(loc=500, scale=80)

print('P(visitors < 600):', round(rv.cdf(600), 4))
print('95th percentile  :', round(rv.ppf(0.95), 1))
print('P(420 < X < 580) :', round(rv.cdf(580) - rv.cdf(420), 4))

▶ Output

P(visitors < 600): 0.8944
95th percentile  : 631.6
P(420 < X < 580) : 0.6827

That last line is the 68% rule in action: one standard deviation either side of the mean holds ~68.3% of the data.

The binomial distribution (counts of successes)

# Send 10 emails, each opens with probability 0.2
print('P(exactly 3 open):', round(stats.binom.pmf(3, n=10, p=0.2), 4))
print('P(3 or fewer)   :', round(stats.binom.cdf(3, n=10, p=0.2), 4))

▶ Output

P(exactly 3 open): 0.2013
P(3 or fewer)   : 0.8791

Distribution	Models	SciPy
Normal	heights, errors, sums of many effects	`stats.norm`
Binomial	successes in n yes/no trials	`stats.binom`
Poisson	events per interval (arrivals)	`stats.poisson`
Uniform	equally likely outcomes	`stats.uniform`

Key points

SciPy gives every distribution pdf/pmf, cdf and ppf (percentile) methods.
Normal for continuous sums; binomial for counts of successes; Poisson for event rates.
cdf(b) - cdf(a) gives the probability a value falls in the interval (a, b).

3Sampling distributions & the Central Limit Theorem

The single most important idea in inference: even if your data is wildly non-normal, the distribution of the sample mean becomes approximately normal as the sample grows. That is the Central Limit Theorem (CLT), and it is why so much of statistics works.

See it happen

import numpy as np

rng = np.random.default_rng(0)
population = rng.exponential(scale=10, size=100_000)   # very skewed

# Take 1000 samples of size 50 and record each mean
means = [rng.choice(population, 50).mean() for _ in range(1000)]

print('Population mean      :', round(population.mean(), 2))
print('Mean of sample means :', round(np.mean(means), 2))
print('Std of sample means  :', round(np.std(means), 2))
print('Predicted SE (s/√n):', round(population.std() / np.sqrt(50), 2))

▶ Output

Population mean      : 9.98
Mean of sample means : 9.97
Std of sample means  : 1.40
Predicted SE (s/√n): 1.41

However skewed the population, the means of repeated samples form a tight, normal curve.

The √n law. The standard error of the mean shrinks like 1/√n. To halve your uncertainty you need four times the data — a brutal but vital fact for planning experiments and reading error bars.

Key points

The Central Limit Theorem: sample means are approximately normal regardless of the population's shape.
The standard error of the mean is σ/√n — uncertainty shrinks with the square root of sample size.
Quadrupling the data only halves the error — plan sample sizes accordingly.

4Confidence intervals & estimation

A point estimate (“mean tip = $3.00”) is incomplete without a measure of uncertainty. A confidence interval (CI) gives a plausible range for the true value.

Build a 95% CI for a mean

import numpy as np
from scipy import stats
import seaborn as sns

tips = sns.load_dataset('tips')
x = tips['tip']

mean = x.mean()
sem  = stats.sem(x)                     # standard error of the mean
ci   = stats.t.interval(0.95, len(x) - 1, loc=mean, scale=sem)

print(f'Mean tip : {mean:.2f}')
print(f'95% CI   : ({ci[0]:.2f}, {ci[1]:.2f})')

▶ Output

Mean tip : 3.00
95% CI   : (2.82, 3.17)

What a 95% CI really means. It is not “95% chance the true mean is in this interval.” It means: if we repeated the sampling many times, about 95% of the intervals we build would contain the true mean. The confidence is in the procedure, not in any one interval. Stating this correctly marks you as a serious analyst.

Bootstrap: a CI with no formula

# Resample with replacement 10,000 times, recompute the mean each time
rng = np.random.default_rng(1)
boot = [rng.choice(x, len(x), replace=True).mean() for _ in range(10_000)]
lo, hi = np.percentile(boot, [2.5, 97.5])
print(f'Bootstrap 95% CI: ({lo:.2f}, {hi:.2f})')

▶ Output

Bootstrap 95% CI: (2.83, 3.18)

The bootstrap reaches almost the same interval with zero distribution assumptions — just resampling. It is the analyst's Swiss-army knife when formulas get hard.

Key points

A confidence interval reports a plausible range for a parameter, not just a point estimate.
stats.t.interval builds a CI for a mean; the bootstrap builds one by resampling, assumption-free.
A 95% CI describes the long-run reliability of the procedure, not the probability for one interval.

5Hypothesis testing & p-values

A hypothesis test asks: could this difference be just noise? You assume “no real effect” (the null hypothesis), then compute how surprising your data would be if that were true. That surprise is the p-value.

A two-sample t-test

from scipy import stats

dinner = tips[tips['time'] == 'Dinner']['total_bill']
lunch  = tips[tips['time'] == 'Lunch']['total_bill']

t, p = stats.ttest_ind(dinner, lunch, equal_var=False)
print(f'Dinner mean: {dinner.mean():.2f}')
print(f'Lunch mean : {lunch.mean():.2f}')
print(f't = {t:.3f}, p = {p:.4f}')

▶ Output

Dinner mean: 20.80
Lunch mean : 17.17
t = 2.898, p = 0.0043

p = 0.004 is well below 0.05, so we reject the null: dinner bills are significantly higher than lunch bills.

If the test statistic lands in a red tail (a rare region under the null), we reject the null.

What a p-value is not. It is not the probability the null is true, and it says nothing about effect size. A tiny, useless difference can be “significant” with enough data. Always report the effect size and a confidence interval alongside the p-value, and fix your threshold (e.g. 0.05) before looking.

Key points

A hypothesis test measures how surprising the data is if the null (no effect) were true.
A small p-value (< your pre-set threshold) leads you to reject the null hypothesis.
A p-value is not the probability the null is true and ignores effect size — always report both.

6A/B testing & a glimpse of Bayes

A/B testing is hypothesis testing applied to product decisions: show variant A to one group, B to another, and test whether the difference in outcomes is real.

Compare two conversion rates

import numpy as np
from statsmodels.stats.proportion import proportions_ztest

# A: 120 / 1000 converted   B: 150 / 1000 converted
conversions = np.array([120, 150])
visitors    = np.array([1000, 1000])

z, p = proportions_ztest(conversions, visitors)
print(f'A rate: {120/1000:.1%}   B rate: {150/1000:.1%}')
print(f'z = {z:.3f}, p = {p:.4f}')

▶ Output

A rate: 12.0%   B rate: 15.0%
z = -1.963, p = 0.0497

p = 0.0497 is just under 0.05 — technically significant, but barely. A careful analyst would want a larger sample before betting the roadmap on a 3-point lift this marginal.

Designing a trustworthy test

Power & sample size: decide the smallest lift worth detecting and compute the sample size before you start.
Do not peek: repeatedly checking and stopping when p < 0.05 inflates false positives. Fix the duration up front.
Randomise properly and check the groups are balanced.
One metric: testing twenty metrics guarantees a “significant” fluke (multiple-comparisons problem).

The Bayesian alternative

Frequentist tests ask “how surprising is the data under the null?” Bayesian methods ask the question businesses actually want: “given the data, what is the probability B is better than A?” You start with a prior belief and update it with evidence to get a posterior. Both views are valuable; the Bayesian framing is often easier to act on.

Statistics is the immune system of data science. Models find patterns; statistics tells you which patterns to trust. The analysts who get promoted are the ones who can say, with evidence, “this result is real and this much” — or honestly, “we do not have enough data yet.”

Key points

A/B testing applies hypothesis testing to decisions; use a proportions z-test for conversion rates.
Pre-compute sample size, do not peek-and-stop, randomise, and avoid testing many metrics at once.
Bayesian methods answer 'P(B better than A | data)' directly by updating a prior with evidence.

★ Hands-on Project — Run and Report an A/B Test

Simulate or use real experiment data and produce a rigorous, honest A/B-test analysis a product manager could act on.

Generate or load two groups: control and variant, each with a binary outcome (converted / not) of at least 1,000 users per arm.
Report descriptive stats: conversion rate and a 95% confidence interval for each group.
State your null and alternative hypotheses and your significance threshold before testing.
Run a proportions z-test (or a t-test if the outcome is continuous) and report the test statistic and p-value.
Compute the effect size (absolute and relative lift) and a confidence interval for the difference — not just the p-value.
Use a bootstrap to produce a CI for the difference in rates and confirm it agrees with the formula.
Write a clear recommendation: ship, do not ship, or collect more data — and justify it with effect size + uncertainty, not p-value alone.
Add a short 'threats to validity' section (peeking, sample ratio mismatch, multiple metrics) and commit to your portfolio.

Ready to test yourself?

Take the module quiz. Score 70% or more to mark this module complete.

Start the quiz →

💡 Log in to save your progress and earn the certificate.

← Previous

Exploratory Data Analysis & Visualisation

Machine Learning Foundations