Data science sits on three legs: programming, mathematics and domain judgement. Most courses rush to fancy models and leave the maths as a mystery — so students can call model.fit() but cannot say why it works or fix it when it breaks. This module deliberately builds the foundation first. You will set up a professional, reproducible workspace, learn the one library every other library is built on — NumPy — and develop a working, code-first intuition for the linear algebra, calculus and probability that power every model you will train later. Light on proofs, heavy on running code.
1The data-science lifecycle & a reproducible environment
A data scientist does not just train models — they run a project from a fuzzy business question all the way to a deployed, monitored system. The industry-standard map for this is CRISP-DM, and every module of this course slots into one of its phases.
Set up your environment (pick one path)
- Zero-install (start here): Google Colab — full Python with NumPy, pandas and GPUs in your browser. Visit colab.research.google.com.
- On your machine: install Anaconda (or lightweight Miniconda), which bundles Python 3, Jupyter and the scientific stack.
- Editor: VS Code with the Python and Jupyter extensions for larger projects.
Isolate every project (this matters)
A reproducible project pins its exact libraries so it runs the same on any machine, next year, for anyone. Create a fresh, isolated environment per project:
# with conda
conda create -n ds-course python=3.11 numpy pandas scikit-learn jupyter
conda activate ds-course
# or with plain Python
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install numpy pandas scikit-learn jupyter
# freeze exact versions so others reproduce your results
pip freeze > requirements.txtYour first scientific-Python cell
import sys
import numpy as np
print('Python :', sys.version.split()[0])
print('NumPy :', np.__version__)
print('Ready for data science!')Python : 3.11.7 NumPy : 1.26.4 Ready for data science!
requirements.txt to Git and, before sharing a notebook, restart the kernel and “Run all”. If it runs top-to-bottom with no errors, it is reproducible — the single most-respected discipline in professional data science.- The CRISP-DM lifecycle (Business → Data → Prepare → Model → Evaluate → Deploy) is a cycle, not a straight line.
- Use Colab to start instantly; use conda/venv to isolate each project and pin versions with
requirements.txt. - Restart-and-run-all is the test of a reproducible notebook — make it a habit from day one.
2NumPy: arrays, vectorisation & broadcasting
Every serious data-science library — pandas, scikit-learn, PyTorch — is built on NumPy. Its core object is the ndarray: a grid of numbers that you operate on all at once, with no Python loop. This is called vectorisation, and it is both faster to write and dramatically faster to run.
Lists vs NumPy arrays
| Python list | NumPy array | |
|---|---|---|
| Maths on the whole thing | needs a loop | one expression |
| Speed on big data | slow | ~10–100× faster |
| Memory | heavy | compact (fixed type) |
| Multi-dimensional | awkward | native (matrices, tensors) |
Vectorised arithmetic — no loops
import numpy as np
prices = np.array([250.0, 99.5, 430.0, 75.0, 999.0])
qty = np.array([4, 10, 2, 8, 1])
revenue = prices * qty # element-wise, no loop
print('Revenue per line:', revenue)
print('Total revenue :', revenue.sum())
print('Mean price :', prices.mean())Revenue per line: [1000. 995. 860. 600. 999.] Total revenue : 4454.0 Mean price : 370.7
One line, prices * qty, multiplied five pairs of numbers. On a million rows the code looks identical — and runs in milliseconds.
Shapes, axes & broadcasting
Arrays have a shape (rows, columns…). Broadcasting lets NumPy stretch a smaller array to fit a bigger one, so you can, say, add a single number to every cell, or subtract a per-column mean.
matrix = np.array([[1, 2, 3],
[4, 5, 6]])
print(matrix.shape) # (rows, columns)
print(matrix + 10) # broadcast a scalar to every cell
print(matrix.mean(axis=0)) # mean down each column
print(matrix.sum(axis=1)) # sum across each row(2, 3) [[11 12 13] [14 15 16]] [2.5 3.5 4.5] [ 6 15]
axis as “the direction that collapses”. axis=0 collapses the rows, leaving one value per column; axis=1 collapses the columns, leaving one value per row. This trips up nearly every beginner — say it out loud each time until it sticks.Boolean masking — filter without a loop
data = np.array([12, 45, 7, 88, 23, 64])
mask = data > 30 # a True/False array
print(mask)
print(data[mask]) # keep only matching values
print('How many > 30:', mask.sum())[False True False True False True] [45 88 64] How many > 30: 3
df[df['amount'] > 30] in pandas (Module 2) — you are learning the engine before the steering wheel.- NumPy's
ndarraylets you compute on entire arrays at once — vectorisation, no loops. axis=0collapses rows (per-column result);axis=1collapses columns (per-row result).- Broadcasting stretches smaller arrays to fit; Boolean masks filter data the way pandas later will.
3Linear algebra you actually need
Do not panic — you need surprisingly little, but you need it deeply. A model stores what it has learned as a vector of weights, and it makes a prediction with a dot product. Master those two ideas and most of machine learning stops being magic.
Vectors, weights & the dot product
A row of data is a feature vector. The model holds a matching weight for each feature. A prediction is the weighted sum of the two — exactly the dot product.
import numpy as np
# Features for one house: [bedrooms, area_sqft, age_years]
x = np.array([3, 1200, 10])
# Weights a model has learned (price impact of each feature)
w = np.array([500000, 3000, -20000])
price = np.dot(w, x) # weighted sum == w @ x
print('Predicted price:', price)Predicted price: 4900000
That is 500000×3 + 3000×1200 + (-20000)×10. Every linear model, every layer of a neural network, is built from this one operation.
Matrices: predict every row at once
Stack many feature vectors into a matrix X and a single matrix multiply (@) predicts the whole dataset in one shot.
# Three houses, three features each
X = np.array([[3, 1200, 10],
[2, 800, 5],
[4, 2000, 20]])
predictions = X @ w # matrix-vector product
print(predictions)[4900000 3300000 7600000]
X @ w, the number of columns in X must equal the length of w. A shape mismatch is the most common error in all of machine learning — when in doubt, print .shape.- A prediction is a dot product: the weighted sum of a feature vector and a weight vector.
- Matrix multiplication (
X @ w) predicts an entire dataset in a single operation. - Shapes must be compatible (columns of
X= length ofw) — print.shapeto debug.
4Calculus & gradient descent — how models learn
You will not solve integrals by hand. But you do need one idea: a derivative is a slope — it tells you which way is “downhill”. Training a model means repeatedly stepping downhill on an error surface until the error is as small as possible. That algorithm is gradient descent, and it powers almost everything in modern machine learning.
The recipe
- Define a loss (how wrong the model is).
- Compute its gradient (the slope — which way increases the loss).
- Step the opposite way by a small learning rate.
- Repeat until the loss stops shrinking.
Watch it minimise a simple bowl-shaped function, f(x) = (x - 3)² + 2, whose lowest point is at x = 3:
def f(x): return (x - 3)**2 + 2 # the loss (a bowl)
def grad(x): return 2 * (x - 3) # its slope (derivative)
x = 0.0 # start far from the minimum
lr = 0.1 # learning rate (step size)
for step in range(5):
x = x - lr * grad(x) # step downhill
print(f'step {step+1}: x = {x:.4f}, f(x) = {f(x):.4f}')step 1: x = 0.6000, f(x) = 7.7600 step 2: x = 1.0800, f(x) = 5.6864 step 3: x = 1.4640, f(x) = 4.3593 step 4: x = 1.7712, f(x) = 3.5099 step 5: x = 2.0170, f(x) = 2.9664
x marches steadily toward 3 and the loss falls toward its minimum of 2. That is learning, stripped to its essence.
- A derivative is a slope; the gradient points uphill, so we step the opposite way to reduce error.
- Gradient descent = loss → gradient → step downhill by the learning rate → repeat.
- The learning rate controls step size: too small is slow, too large diverges.
5Probability & statistics foundations
Data is noisy, and data science is the craft of reasoning under that noise. You need a feel for distributions (how values spread), summary statistics (mean, variance, standard deviation) and sampling (simulating randomness reproducibly).
Simulate and summarise
NumPy's modern random generator lets us simulate data — and a fixed seed makes it reproducible.
import numpy as np
rng = np.random.default_rng(42) # reproducible randomness
# Simulate 10,000 days of website visitors (normally distributed)
visitors = rng.normal(loc=500, scale=80, size=10000)
print('Mean :', round(visitors.mean(), 1))
print('Std :', round(visitors.std(), 1))
print('Median :', round(np.median(visitors), 1))
print('P(>600):', round((visitors > 600).mean(), 3))Mean : 500.3 Std : 79.5 Median : 500.1 P(>600): 0.106
We asked “what fraction of days exceed 600 visitors?” and answered it from the data: about 10.6%. Notice the trick — (visitors > 600) is a Boolean array, and its .mean() is the proportion of True values.
The normal distribution & the 68–95–99.7 rule
The bell-shaped normal distribution appears everywhere. For it, about 68% of values fall within one standard deviation of the mean, 95% within two, and 99.7% within three.
- Set a seed (
default_rng(42)) so simulations and experiments are reproducible. (array > value).mean()gives the proportion satisfying a condition — a probability estimate.- The normal distribution's 68–95–99.7 rule; prefer the median over the mean for skewed data.
6From mathematics to a model — a vectorised mini-project
Time to connect every thread of this module. We will fit a straight line to data using least squares — the very first machine-learning model — using nothing but NumPy. Vectors, the dot product, minimising error: it is all here.
Fit a line with NumPy
import numpy as np
# Hours studied vs exam score
hours = np.array([1, 2, 3, 4, 5, 6, 7, 8])
scores = np.array([52, 56, 64, 67, 73, 78, 82, 89])
# Fit score = m*hours + c by least squares (degree-1 polynomial)
m, c = np.polyfit(hours, scores, 1)
print(f'slope (m) = {m:.2f}')
print(f'intercept (c) = {c:.2f}')
# Use the fitted line to predict a new value
pred = m * 9 + c
print(f'Predicted score for 9 hours: {pred:.1f}')slope (m) = 5.20 intercept (c) = 46.71 Predicted score for 9 hours: 93.5
The model learned that each extra hour of study is worth about 5.2 marks, starting from a baseline of ~47. Behind polyfit, NumPy solved a linear-algebra system that minimises squared error — gradient descent's closed-form cousin.
Measure how good the fit is
# Predictions on the training data, then the error
fit = m * hours + c
error = scores - fit
mse = (error ** 2).mean() # mean squared error
rmse = np.sqrt(mse)
print(f'RMSE = {rmse:.2f} marks')
# R-squared: fraction of variance explained
ss_res = (error ** 2).sum()
ss_tot = ((scores - scores.mean()) ** 2).sum()
r2 = 1 - ss_res / ss_tot
print(f'R-squared = {r2:.3f}')RMSE = 1.13 marks R-squared = 0.995
An R² of 0.995 means the line explains 99.5% of the variation — a near-perfect fit for this clean, made-up data. With real, noisy data you will rarely see numbers this tidy, and that honesty is the whole job.
LinearRegression — but you now understand exactly what it does under the hood, because you built it from vectors and error yourself.np.polyfit(x, y, 1)fits a least-squares line — the simplest ML model, built on linear algebra.- Evaluate fit with RMSE (typical error size) and R² (fraction of variance explained).
- scikit-learn's
LinearRegression(Module 5) automates exactly this — you now know the internals.
★ Hands-on Project — A NumPy-only Data Toolkit
Cement the module by building a tiny analysis toolkit with no library beyond NumPy. You will simulate data, summarise it, fit a model and evaluate it — the whole lifecycle in miniature.
- Start a fresh notebook in a project folder with its own conda/venv environment and a
requirements.txt. - Use
np.random.default_rng(7)to simulate 1,000 students'hours_studied(normal, mean 5, std 1.5) andscores=50 + 6*hours + noise, where noise is normal with std 4. - Write a function
describe(arr)that returns mean, median, std, min and max using NumPy — nostatisticsmodule. - Use Boolean masking to report what fraction of students scored above 80, and the mean hours of just those students.
- Fit a line with
np.polyfitand print the slope, intercept, RMSE and R² using the formulas from Topic 6. - Implement gradient descent by hand to find the slope/intercept too, and confirm it converges close to
polyfit's answer. - Write a one-paragraph markdown cell interpreting your results in plain English (what does the slope mean for a student?).
- Restart the kernel, Run All to prove reproducibility, then commit the notebook and
requirements.txtto a new GitHub repo.
Ready to test yourself?
Take the module quiz. Score 70% or more to mark this module complete.
Start the quiz →💡 Log in to save your progress and earn the certificate.