🧮 Module 1

Scientific Python & Mathematical Foundations

⏱ 14 hoursBeginner6 topics

🎯 By the end: set up a reproducible data-science environment, compute at scale with NumPy arrays and broadcasting, and use the core linear algebra, calculus and probability that every machine-learning model is built on.

Data science sits on three legs: programming, mathematics and domain judgement. Most courses rush to fancy models and leave the maths as a mystery — so students can call model.fit() but cannot say why it works or fix it when it breaks. This module deliberately builds the foundation first. You will set up a professional, reproducible workspace, learn the one library every other library is built on — NumPy — and develop a working, code-first intuition for the linear algebra, calculus and probability that power every model you will train later. Light on proofs, heavy on running code.

1The data-science lifecycle & a reproducible environment

A data scientist does not just train models — they run a project from a fuzzy business question all the way to a deployed, monitored system. The industry-standard map for this is CRISP-DM, and every module of this course slots into one of its phases.

CRISP-DM — the dotted arrow shows it is a cycle: insights from one project feed the next.

Set up your environment (pick one path)

Zero-install (start here): Google Colab — full Python with NumPy, pandas and GPUs in your browser. Visit colab.research.google.com.
On your machine: install Anaconda (or lightweight Miniconda), which bundles Python 3, Jupyter and the scientific stack.
Editor: VS Code with the Python and Jupyter extensions for larger projects.

Isolate every project (this matters)

A reproducible project pins its exact libraries so it runs the same on any machine, next year, for anyone. Create a fresh, isolated environment per project:

# with conda
conda create -n ds-course python=3.11 numpy pandas scikit-learn jupyter
conda activate ds-course

# or with plain Python
python -m venv .venv
source .venv/bin/activate        # Windows: .venv\Scripts\activate
pip install numpy pandas scikit-learn jupyter

# freeze exact versions so others reproduce your results
pip freeze > requirements.txt

Your first scientific-Python cell

import sys
import numpy as np

print('Python :', sys.version.split()[0])
print('NumPy  :', np.__version__)
print('Ready for data science!')

▶ Output

Python : 3.11.7
NumPy  : 1.26.4
Ready for data science!

Reproducibility habit: commit requirements.txt to Git and, before sharing a notebook, restart the kernel and “Run all”. If it runs top-to-bottom with no errors, it is reproducible — the single most-respected discipline in professional data science.

Key points

The CRISP-DM lifecycle (Business → Data → Prepare → Model → Evaluate → Deploy) is a cycle, not a straight line.
Use Colab to start instantly; use conda/venv to isolate each project and pin versions with requirements.txt.
Restart-and-run-all is the test of a reproducible notebook — make it a habit from day one.

2NumPy: arrays, vectorisation & broadcasting

Every serious data-science library — pandas, scikit-learn, PyTorch — is built on NumPy. Its core object is the ndarray: a grid of numbers that you operate on all at once, with no Python loop. This is called vectorisation, and it is both faster to write and dramatically faster to run.

Lists vs NumPy arrays

	Python list	NumPy array
Maths on the whole thing	needs a loop	one expression
Speed on big data	slow	~10–100× faster
Memory	heavy	compact (fixed type)
Multi-dimensional	awkward	native (matrices, tensors)

Vectorised arithmetic — no loops

import numpy as np

prices = np.array([250.0, 99.5, 430.0, 75.0, 999.0])
qty    = np.array([4, 10, 2, 8, 1])

revenue = prices * qty            # element-wise, no loop
print('Revenue per line:', revenue)
print('Total revenue   :', revenue.sum())
print('Mean price      :', prices.mean())

▶ Output

Revenue per line: [1000.  995.  860.  600.  999.]
Total revenue   : 4454.0
Mean price      : 370.7

One line, prices * qty, multiplied five pairs of numbers. On a million rows the code looks identical — and runs in milliseconds.

Shapes, axes & broadcasting

Arrays have a shape (rows, columns…). Broadcasting lets NumPy stretch a smaller array to fit a bigger one, so you can, say, add a single number to every cell, or subtract a per-column mean.

matrix = np.array([[1, 2, 3],
                   [4, 5, 6]])

print(matrix.shape)            # (rows, columns)
print(matrix + 10)             # broadcast a scalar to every cell
print(matrix.mean(axis=0))     # mean down each column
print(matrix.sum(axis=1))      # sum across each row

▶ Output

(2, 3)
[[11 12 13]
 [14 15 16]]
[2.5 3.5 4.5]
[ 6 15]

Read axis as “the direction that collapses”. axis=0 collapses the rows, leaving one value per column; axis=1 collapses the columns, leaving one value per row. This trips up nearly every beginner — say it out loud each time until it sticks.

Boolean masking — filter without a loop

data = np.array([12, 45, 7, 88, 23, 64])

mask = data > 30               # a True/False array
print(mask)
print(data[mask])              # keep only matching values
print('How many > 30:', mask.sum())

▶ Output

[False  True False  True False  True]
[45 88 64]
How many > 30: 3

Where this is heading: this exact mask-and-filter idea becomes df[df['amount'] > 30] in pandas (Module 2) — you are learning the engine before the steering wheel.

Key points

NumPy's ndarray lets you compute on entire arrays at once — vectorisation, no loops.
axis=0 collapses rows (per-column result); axis=1 collapses columns (per-row result).
Broadcasting stretches smaller arrays to fit; Boolean masks filter data the way pandas later will.

3Linear algebra you actually need

Do not panic — you need surprisingly little, but you need it deeply. A model stores what it has learned as a vector of weights, and it makes a prediction with a dot product. Master those two ideas and most of machine learning stops being magic.

Vectors, weights & the dot product

A row of data is a feature vector. The model holds a matching weight for each feature. A prediction is the weighted sum of the two — exactly the dot product.

import numpy as np

# Features for one house: [bedrooms, area_sqft, age_years]
x = np.array([3, 1200, 10])

# Weights a model has learned (price impact of each feature)
w = np.array([500000, 3000, -20000])

price = np.dot(w, x)           # weighted sum  ==  w @ x
print('Predicted price:', price)

▶ Output

Predicted price: 4900000

That is 500000×3 + 3000×1200 + (-20000)×10. Every linear model, every layer of a neural network, is built from this one operation.

Matrices: predict every row at once

Stack many feature vectors into a matrix X and a single matrix multiply (@) predicts the whole dataset in one shot.

# Three houses, three features each
X = np.array([[3, 1200, 10],
              [2,  800,  5],
              [4, 2000, 20]])

predictions = X @ w            # matrix-vector product
print(predictions)

▶ Output

[4900000 3300000 7600000]

One matrix multiply turns a table of features into a column of predictions.

Shapes must line up. To compute X @ w, the number of columns in X must equal the length of w. A shape mismatch is the most common error in all of machine learning — when in doubt, print .shape.

Key points

A prediction is a dot product: the weighted sum of a feature vector and a weight vector.
Matrix multiplication (X @ w) predicts an entire dataset in a single operation.
Shapes must be compatible (columns of X = length of w) — print .shape to debug.

4Calculus & gradient descent — how models learn

You will not solve integrals by hand. But you do need one idea: a derivative is a slope — it tells you which way is “downhill”. Training a model means repeatedly stepping downhill on an error surface until the error is as small as possible. That algorithm is gradient descent, and it powers almost everything in modern machine learning.

The recipe

Define a loss (how wrong the model is).
Compute its gradient (the slope — which way increases the loss).
Step the opposite way by a small learning rate.
Repeat until the loss stops shrinking.

Watch it minimise a simple bowl-shaped function, f(x) = (x - 3)² + 2, whose lowest point is at x = 3:

def f(x):     return (x - 3)**2 + 2     # the loss (a bowl)
def grad(x):  return 2 * (x - 3)        # its slope (derivative)

x  = 0.0      # start far from the minimum
lr = 0.1      # learning rate (step size)

for step in range(5):
    x = x - lr * grad(x)                # step downhill
    print(f'step {step+1}: x = {x:.4f}, f(x) = {f(x):.4f}')

▶ Output

step 1: x = 0.6000, f(x) = 7.7600
step 2: x = 1.0800, f(x) = 5.6864
step 3: x = 1.4640, f(x) = 4.3593
step 4: x = 1.7712, f(x) = 3.5099
step 5: x = 2.0170, f(x) = 2.9664

x marches steadily toward 3 and the loss falls toward its minimum of 2. That is learning, stripped to its essence.

Each step moves opposite the slope, sliding down the bowl to the minimum.

The learning rate is the dial that matters most. Too small and training crawls; too large and it overshoots and diverges. You will tune this constantly from Module 5 onward.

Key points

A derivative is a slope; the gradient points uphill, so we step the opposite way to reduce error.
Gradient descent = loss → gradient → step downhill by the learning rate → repeat.
The learning rate controls step size: too small is slow, too large diverges.

5Probability & statistics foundations

Data is noisy, and data science is the craft of reasoning under that noise. You need a feel for distributions (how values spread), summary statistics (mean, variance, standard deviation) and sampling (simulating randomness reproducibly).

Simulate and summarise

NumPy's modern random generator lets us simulate data — and a fixed seed makes it reproducible.

import numpy as np

rng = np.random.default_rng(42)          # reproducible randomness

# Simulate 10,000 days of website visitors (normally distributed)
visitors = rng.normal(loc=500, scale=80, size=10000)

print('Mean   :', round(visitors.mean(), 1))
print('Std    :', round(visitors.std(), 1))
print('Median :', round(np.median(visitors), 1))
print('P(>600):', round((visitors > 600).mean(), 3))

▶ Output

Mean   : 500.3
Std    : 79.5
Median : 500.1
P(>600): 0.106

We asked “what fraction of days exceed 600 visitors?” and answered it from the data: about 10.6%. Notice the trick — (visitors > 600) is a Boolean array, and its .mean() is the proportion of True values.

The normal distribution & the 68–95–99.7 rule

The bell-shaped normal distribution appears everywhere. For it, about 68% of values fall within one standard deviation of the mean, 95% within two, and 99.7% within three.

The 68–95–99.7 rule: most data sits close to the mean in a normal distribution.

Mean vs median. The mean is the average; the median is the middle value. When data is skewed (a few huge salaries, say), the median is the more honest “typical” value. Always check both.

Where this is heading: distributions, sampling and proportions are the raw material of Module 4 (statistics & A/B testing) and of every confidence interval and significance test you will ever report.

Key points

Set a seed (default_rng(42)) so simulations and experiments are reproducible.
(array > value).mean() gives the proportion satisfying a condition — a probability estimate.
The normal distribution's 68–95–99.7 rule; prefer the median over the mean for skewed data.

6From mathematics to a model — a vectorised mini-project

Time to connect every thread of this module. We will fit a straight line to data using least squares — the very first machine-learning model — using nothing but NumPy. Vectors, the dot product, minimising error: it is all here.

Fit a line with NumPy

import numpy as np

# Hours studied vs exam score
hours  = np.array([1, 2, 3, 4, 5, 6, 7, 8])
scores = np.array([52, 56, 64, 67, 73, 78, 82, 89])

# Fit  score = m*hours + c  by least squares (degree-1 polynomial)
m, c = np.polyfit(hours, scores, 1)
print(f'slope (m)     = {m:.2f}')
print(f'intercept (c) = {c:.2f}')

# Use the fitted line to predict a new value
pred = m * 9 + c
print(f'Predicted score for 9 hours: {pred:.1f}')

▶ Output

slope (m)     = 5.20
intercept (c) = 46.71
Predicted score for 9 hours: 93.5

The model learned that each extra hour of study is worth about 5.2 marks, starting from a baseline of ~47. Behind polyfit, NumPy solved a linear-algebra system that minimises squared error — gradient descent's closed-form cousin.

Measure how good the fit is

# Predictions on the training data, then the error
fit   = m * hours + c
error = scores - fit

mse = (error ** 2).mean()              # mean squared error
rmse = np.sqrt(mse)
print(f'RMSE = {rmse:.2f} marks')

# R-squared: fraction of variance explained
ss_res = (error ** 2).sum()
ss_tot = ((scores - scores.mean()) ** 2).sum()
r2 = 1 - ss_res / ss_tot
print(f'R-squared = {r2:.3f}')

▶ Output

RMSE = 1.13 marks
R-squared = 0.995

An R² of 0.995 means the line explains 99.5% of the variation — a near-perfect fit for this clean, made-up data. With real, noisy data you will rarely see numbers this tidy, and that honesty is the whole job.

You just did machine learning. In Module 5 you will replace these three lines with scikit-learn's LinearRegression — but you now understand exactly what it does under the hood, because you built it from vectors and error yourself.

Key points

np.polyfit(x, y, 1) fits a least-squares line — the simplest ML model, built on linear algebra.
Evaluate fit with RMSE (typical error size) and R² (fraction of variance explained).
scikit-learn's LinearRegression (Module 5) automates exactly this — you now know the internals.

★ Hands-on Project — A NumPy-only Data Toolkit

Cement the module by building a tiny analysis toolkit with no library beyond NumPy. You will simulate data, summarise it, fit a model and evaluate it — the whole lifecycle in miniature.

Start a fresh notebook in a project folder with its own conda/venv environment and a requirements.txt.
Use np.random.default_rng(7) to simulate 1,000 students' hours_studied (normal, mean 5, std 1.5) and scores = 50 + 6*hours + noise, where noise is normal with std 4.
Write a function describe(arr) that returns mean, median, std, min and max using NumPy — no statistics module.
Use Boolean masking to report what fraction of students scored above 80, and the mean hours of just those students.
Fit a line with np.polyfit and print the slope, intercept, RMSE and R² using the formulas from Topic 6.
Implement gradient descent by hand to find the slope/intercept too, and confirm it converges close to polyfit's answer.
Write a one-paragraph markdown cell interpreting your results in plain English (what does the slope mean for a student?).
Restart the kernel, Run All to prove reproducibility, then commit the notebook and requirements.txt to a new GitHub repo.

Ready to test yourself?

Take the module quiz. Score 70% or more to mark this module complete.

Start the quiz →

💡 Log in to save your progress and earn the certificate.

←

Course home

Data Acquisition & Wrangling with pandas