Machine learning will happily fit a model to noise and report a confident, wrong answer. Statistics is the discipline that stops that happening. It is how you tell a real signal from a fluke, put honest error bars on an estimate, and decide whether version B truly beats version A or just got lucky. This module is practical, not theorem-heavy: every idea comes with SciPy or statsmodels code and a plain-English interpretation. Get this right and you will trust your own conclusions — and so will everyone reading them.
1Descriptive vs inferential statistics
Two jobs. Descriptive statistics summarise the data you have. Inferential statistics use a sample to draw conclusions about a larger population you cannot fully measure. Almost all data science is inference: we never have every customer, only some.
Centre and spread, precisely
import numpy as np
sample = np.array([12, 15, 14, 10, 18, 22, 16, 13, 19, 25])
print('Mean :', sample.mean())
print('Median :', np.median(sample))
print('Variance:', round(sample.var(ddof=1), 2)) # sample variance
print('Std dev :', round(sample.std(ddof=1), 2))Mean : 16.4 Median : 15.5 Variance: 21.82 Std dev : 4.67
ddof=1 for a sample. Dividing by n−1 (not n) corrects the bias when estimating a population's variance from a sample — this is Bessel's correction, and it is the default you want for real data.- Descriptive statistics summarise data you have; inferential statistics generalise from a sample to a population.
- Mean and median capture centre; variance and standard deviation capture spread.
- Use
ddof=1(n−1) for sample variance/standard deviation — Bessel's correction.
2Probability distributions with SciPy
A distribution describes how likely each outcome is. SciPy's stats module gives every common one a consistent toolkit: pdf/pmf (density), cdf (probability up to a point) and ppf (the inverse — percentiles).
The normal distribution
from scipy import stats
# Daily visitors ~ Normal(mean=500, sd=80)
rv = stats.norm(loc=500, scale=80)
print('P(visitors < 600):', round(rv.cdf(600), 4))
print('95th percentile :', round(rv.ppf(0.95), 1))
print('P(420 < X < 580) :', round(rv.cdf(580) - rv.cdf(420), 4))P(visitors < 600): 0.8944 95th percentile : 631.6 P(420 < X < 580) : 0.6827
That last line is the 68% rule in action: one standard deviation either side of the mean holds ~68.3% of the data.
The binomial distribution (counts of successes)
# Send 10 emails, each opens with probability 0.2
print('P(exactly 3 open):', round(stats.binom.pmf(3, n=10, p=0.2), 4))
print('P(3 or fewer) :', round(stats.binom.cdf(3, n=10, p=0.2), 4))P(exactly 3 open): 0.2013 P(3 or fewer) : 0.8791
| Distribution | Models | SciPy |
|---|---|---|
| Normal | heights, errors, sums of many effects | stats.norm |
| Binomial | successes in n yes/no trials | stats.binom |
| Poisson | events per interval (arrivals) | stats.poisson |
| Uniform | equally likely outcomes | stats.uniform |
- SciPy gives every distribution
pdf/pmf,cdfandppf(percentile) methods. - Normal for continuous sums; binomial for counts of successes; Poisson for event rates.
cdf(b) - cdf(a)gives the probability a value falls in the interval (a, b).
3Sampling distributions & the Central Limit Theorem
The single most important idea in inference: even if your data is wildly non-normal, the distribution of the sample mean becomes approximately normal as the sample grows. That is the Central Limit Theorem (CLT), and it is why so much of statistics works.
See it happen
import numpy as np
rng = np.random.default_rng(0)
population = rng.exponential(scale=10, size=100_000) # very skewed
# Take 1000 samples of size 50 and record each mean
means = [rng.choice(population, 50).mean() for _ in range(1000)]
print('Population mean :', round(population.mean(), 2))
print('Mean of sample means :', round(np.mean(means), 2))
print('Std of sample means :', round(np.std(means), 2))
print('Predicted SE (s/√n):', round(population.std() / np.sqrt(50), 2))Population mean : 9.98 Mean of sample means : 9.97 Std of sample means : 1.40 Predicted SE (s/√n): 1.41
√n law. The standard error of the mean shrinks like 1/√n. To halve your uncertainty you need four times the data — a brutal but vital fact for planning experiments and reading error bars.- The Central Limit Theorem: sample means are approximately normal regardless of the population's shape.
- The standard error of the mean is
σ/√n— uncertainty shrinks with the square root of sample size. - Quadrupling the data only halves the error — plan sample sizes accordingly.
4Confidence intervals & estimation
A point estimate (“mean tip = $3.00”) is incomplete without a measure of uncertainty. A confidence interval (CI) gives a plausible range for the true value.
Build a 95% CI for a mean
import numpy as np
from scipy import stats
import seaborn as sns
tips = sns.load_dataset('tips')
x = tips['tip']
mean = x.mean()
sem = stats.sem(x) # standard error of the mean
ci = stats.t.interval(0.95, len(x) - 1, loc=mean, scale=sem)
print(f'Mean tip : {mean:.2f}')
print(f'95% CI : ({ci[0]:.2f}, {ci[1]:.2f})')Mean tip : 3.00 95% CI : (2.82, 3.17)
Bootstrap: a CI with no formula
# Resample with replacement 10,000 times, recompute the mean each time
rng = np.random.default_rng(1)
boot = [rng.choice(x, len(x), replace=True).mean() for _ in range(10_000)]
lo, hi = np.percentile(boot, [2.5, 97.5])
print(f'Bootstrap 95% CI: ({lo:.2f}, {hi:.2f})')Bootstrap 95% CI: (2.83, 3.18)
The bootstrap reaches almost the same interval with zero distribution assumptions — just resampling. It is the analyst's Swiss-army knife when formulas get hard.
- A confidence interval reports a plausible range for a parameter, not just a point estimate.
stats.t.intervalbuilds a CI for a mean; the bootstrap builds one by resampling, assumption-free.- A 95% CI describes the long-run reliability of the procedure, not the probability for one interval.
5Hypothesis testing & p-values
A hypothesis test asks: could this difference be just noise? You assume “no real effect” (the null hypothesis), then compute how surprising your data would be if that were true. That surprise is the p-value.
A two-sample t-test
from scipy import stats
dinner = tips[tips['time'] == 'Dinner']['total_bill']
lunch = tips[tips['time'] == 'Lunch']['total_bill']
t, p = stats.ttest_ind(dinner, lunch, equal_var=False)
print(f'Dinner mean: {dinner.mean():.2f}')
print(f'Lunch mean : {lunch.mean():.2f}')
print(f't = {t:.3f}, p = {p:.4f}')Dinner mean: 20.80 Lunch mean : 17.17 t = 2.898, p = 0.0043
p = 0.004 is well below 0.05, so we reject the null: dinner bills are significantly higher than lunch bills.
- A hypothesis test measures how surprising the data is if the null (no effect) were true.
- A small p-value (< your pre-set threshold) leads you to reject the null hypothesis.
- A p-value is not the probability the null is true and ignores effect size — always report both.
6A/B testing & a glimpse of Bayes
A/B testing is hypothesis testing applied to product decisions: show variant A to one group, B to another, and test whether the difference in outcomes is real.
Compare two conversion rates
import numpy as np
from statsmodels.stats.proportion import proportions_ztest
# A: 120 / 1000 converted B: 150 / 1000 converted
conversions = np.array([120, 150])
visitors = np.array([1000, 1000])
z, p = proportions_ztest(conversions, visitors)
print(f'A rate: {120/1000:.1%} B rate: {150/1000:.1%}')
print(f'z = {z:.3f}, p = {p:.4f}')A rate: 12.0% B rate: 15.0% z = -1.963, p = 0.0497
p = 0.0497 is just under 0.05 — technically significant, but barely. A careful analyst would want a larger sample before betting the roadmap on a 3-point lift this marginal.
Designing a trustworthy test
- Power & sample size: decide the smallest lift worth detecting and compute the sample size before you start.
- Do not peek: repeatedly checking and stopping when p < 0.05 inflates false positives. Fix the duration up front.
- Randomise properly and check the groups are balanced.
- One metric: testing twenty metrics guarantees a “significant” fluke (multiple-comparisons problem).
The Bayesian alternative
Frequentist tests ask “how surprising is the data under the null?” Bayesian methods ask the question businesses actually want: “given the data, what is the probability B is better than A?” You start with a prior belief and update it with evidence to get a posterior. Both views are valuable; the Bayesian framing is often easier to act on.
- A/B testing applies hypothesis testing to decisions; use a proportions z-test for conversion rates.
- Pre-compute sample size, do not peek-and-stop, randomise, and avoid testing many metrics at once.
- Bayesian methods answer 'P(B better than A | data)' directly by updating a prior with evidence.
★ Hands-on Project — Run and Report an A/B Test
Simulate or use real experiment data and produce a rigorous, honest A/B-test analysis a product manager could act on.
- Generate or load two groups: control and variant, each with a binary outcome (converted / not) of at least 1,000 users per arm.
- Report descriptive stats: conversion rate and a 95% confidence interval for each group.
- State your null and alternative hypotheses and your significance threshold before testing.
- Run a proportions z-test (or a t-test if the outcome is continuous) and report the test statistic and p-value.
- Compute the effect size (absolute and relative lift) and a confidence interval for the difference — not just the p-value.
- Use a bootstrap to produce a CI for the difference in rates and confirm it agrees with the formula.
- Write a clear recommendation: ship, do not ship, or collect more data — and justify it with effect size + uncertainty, not p-value alone.
- Add a short 'threats to validity' section (peeking, sample ratio mismatch, multiple metrics) and commit to your portfolio.
Ready to test yourself?
Take the module quiz. Score 70% or more to mark this module complete.
Start the quiz →💡 Log in to save your progress and earn the certificate.