🤖 Module 5

Machine Learning Foundations

⏱ 18 hoursIntermediate6 topics

🎯 By the end: frame a supervised-learning problem, train regression and classification models with scikit-learn's estimator API, evaluate them honestly with the right metrics and cross-validation, build leak-free preprocessing pipelines, and diagnose overfitting.

This is the module the whole course has been building toward. Machine learning is teaching a computer to find patterns from examples instead of hard-coded rules — and scikit-learn is the library that made it accessible to everyone. The magic is a single, consistent design: every model is an estimator with the same fit / predict interface, so once you learn one, you know them all. We will keep the focus where it belongs for a beginner: a clean workflow, honest evaluation, and avoiding the traps that make models look good in a notebook and fail in the real world.

1What is machine learning? The train/test split

ML comes in two main flavours. In supervised learning you have labelled examples (input → known answer) and learn to predict the answer for new inputs. In unsupervised learning there are no labels — you find structure (clusters, themes) in the data itself. This module is all supervised; Module 6 covers unsupervised.

Type	You have	You predict	Example
Regression	labels that are numbers	a number	house price, demand
Classification	labels that are categories	a category	spam / not spam, churn
Clustering	no labels	groups	customer segments

The golden rule: never test on training data

A model that has seen the answers can memorise them. To measure real performance you must hold out data it never saw during training.

Train on most of the data; keep a held-out test set to measure honest performance.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

print('Train rows:', len(X_train), ' Test rows:', len(X_test))

▶ Output

Train rows: 160  Test rows: 40

Key points

Supervised learning predicts a known label; regression predicts numbers, classification predicts categories.
Unsupervised learning finds structure (clusters) without labels.
Always hold out a test set — never evaluate a model on data it trained on.

2Your first model: linear regression

Linear regression predicts a number as a weighted sum of features — exactly the dot product from Module 1, now fitted automatically. It is the “hello world” of ML and shows the scikit-learn pattern you will reuse for every model.

The estimator API: fit → predict → score

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# TV ad spend ($k) -> product sales ($k)
X = np.array([[230],[44],[17],[151],[180],[8],[57],[120],[199],[66]])
y = np.array([22, 10, 9, 18, 19, 6, 11, 16, 21, 12])

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=42)

model = LinearRegression()
model.fit(X_tr, y_tr)              # 1. learn
preds = model.predict(X_te)       # 2. predict

print('Slope (per $1k TV):', round(model.coef_[0], 4))
print('Intercept        :', round(model.intercept_, 2))
print('Test R2          :', round(model.score(X_te, y_te), 3))

▶ Output

Slope (per $1k TV): 0.0731
Intercept        : 6.21
Test R2          : 0.901

The model learned that each extra $1,000 of TV spend lifts sales by about 0.073 units, and it explains ~90% of the variation in the held-out test set. Every scikit-learn model — from a random forest to a neural net — uses this same fit/predict/score trio.

Exact numbers depend on the split. Because train_test_split shuffles, a different random_state gives slightly different coefficients. Setting random_state=42 makes the run reproducible — always pin it in examples and experiments.

Key points

Every scikit-learn model follows the same API: fit(X, y), then predict(X), then score.
Linear regression fits a weighted sum of features; coef_ and intercept_ hold what it learned.
Pin random_state for reproducible splits and results.

3Classification & the confusion matrix

When the label is a category, you classify. Logistic regression — despite its name — is the standard first classifier: it outputs a probability between 0 and 1, then thresholds it into a class.

Train a classifier

from sklearn.linear_model import LogisticRegression

# hours studied -> passed (1) or failed (0)
hours  = [[1],[2],[3],[4],[5],[6],[7],[8],[9],[10]]
passed = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]

clf = LogisticRegression().fit(hours, passed)

print('P(pass | 4.5 hrs):', round(clf.predict_proba([[4.5]])[0, 1], 3))
print('Predict for 7 hrs :', clf.predict([[7]])[0])

▶ Output

P(pass | 4.5 hrs): 0.503
Predict for 7 hrs : 1

Read the confusion matrix

Accuracy alone hides where a classifier fails. The confusion matrix shows the four outcomes: true/false × positive/negative.

from sklearn.metrics import confusion_matrix

y_true = [0, 0, 0, 1, 1, 1, 1, 0, 1, 1]
y_pred = [0, 0, 1, 1, 1, 0, 1, 0, 1, 1]
print(confusion_matrix(y_true, y_pred))

▶ Output

[[3 1]
 [1 5]]

Green = correct, red = errors. Precision and recall come straight from these four cells.

Accuracy lies on imbalanced data. If 99% of emails are not spam, a model that says “never spam” is 99% accurate and useless. Watch precision (of those flagged, how many were right) and recall (of the real positives, how many we caught).

Key points

Logistic regression outputs a probability via predict_proba, then thresholds to a class.
The confusion matrix breaks results into TN, FP, FN, TP — far more informative than accuracy.
On imbalanced data, judge with precision and recall, not accuracy.

4Evaluating models honestly: metrics & cross-validation

Pick the metric that matches the cost of being wrong, and validate on more than one lucky split.

Classification metrics in one report

from sklearn.metrics import classification_report

print(classification_report(y_true, y_pred, digits=2))

▶ Output

              precision    recall  f1-score   support
           0       0.75      0.75      0.75         4
           1       0.83      0.83      0.83         6
    accuracy                           0.80        10
   macro avg       0.79      0.79      0.79        10

Metric	Question it answers	Use when
Precision	Of flagged positives, how many were right?	false alarms are costly
Recall	Of real positives, how many did we catch?	misses are costly (disease, fraud)
F1	Balance of precision & recall	you need both
RMSE / R²	Regression error / variance explained	predicting numbers

Cross-validation: don't trust one split

from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5, scoring='r2')
print('Fold scores:', scores.round(2))
print(f'Mean R2: {scores.mean():.3f} (+/- {scores.std():.3f})')

▶ Output

Fold scores: [0.88 0.91 0.85 0.93 0.89]
Mean R2: 0.892 (+/- 0.028)

Each row holds out a different fold for validation; averaging the scores gives a stable estimate.

Cross-validation beats a single split. One test set might be lucky or unlucky. k-fold rotates the held-out portion, so the mean ± spread tells you both performance and how much it varies.

Key points

Choose metrics by the cost of errors: precision (false alarms), recall (misses), F1 (both), RMSE/R² (regression).
classification_report gives precision, recall and F1 per class at a glance.
k-fold cross-validation rotates the validation fold for a stable mean ± spread estimate.

5Pipelines & preprocessing (no data leakage)

Most models need scaled numbers and encoded categories. The danger is data leakage — letting test information sneak into training (e.g. scaling using the test set's mean). The Pipeline bundles preprocessing with the model so the same fitted transforms apply correctly within each fold.

A clean pipeline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ('scale', StandardScaler()),       # fit on train fold only
    ('model', LogisticRegression()),
])

pipe.fit(X_tr, y_tr)
print('Test accuracy:', round(pipe.score(X_te, y_te), 3))

▶ Output

Test accuracy: 0.95

Mixed data types with ColumnTransformer

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

pre = ColumnTransformer([
    ('num', StandardScaler(), ['age', 'income']),
    ('cat', OneHotEncoder(handle_unknown='ignore'), ['city']),
])

full = Pipeline([('prep', pre), ('model', LogisticRegression())])
# full.fit(X_train, y_train)  ->  one object does everything, leak-free

Pipelines are professional practice, not a nicety. They guarantee that every transform is fit only on training data inside each cross-validation fold — eliminating the silent leakage that makes amateur models look great in the notebook and fail in production.

Key points

Data leakage = test information influencing training; it inflates scores and breaks in production.
Pipeline chains preprocessing + model so transforms fit only on training data per fold.
ColumnTransformer applies scaling to numeric and one-hot encoding to categorical columns together.

6Overfitting, underfitting & the bias–variance trade-off

The central tension in ML. An underfit model is too simple to capture the pattern (high bias). An overfit model memorises noise and fails on new data (high variance). The sweet spot generalises.

Underfit misses the trend; overfit chases every point; the good fit captures the signal, not the noise.

Spot it: train vs test gap

print('Train R2:', round(model.score(X_tr, y_tr), 3))
print('Test  R2:', round(model.score(X_te, y_te), 3))

▶ Output

Train R2: 0.918
Test  R2: 0.901

A small gap (here ~0.02) means the model generalises. A large gap — superb on train, poor on test — is the signature of overfitting.

Regularisation: penalise complexity

from sklearn.linear_model import Ridge

# alpha controls the penalty: higher = simpler, less overfitting
ridge = Ridge(alpha=1.0).fit(X_tr, y_tr)
print('Ridge test R2:', round(ridge.score(X_te, y_te), 3))

▶ Output

Ridge test R2: 0.904

The cures for overfitting: more data, fewer/simpler features, regularisation (Ridge/Lasso), and honest cross-validation. The cure for underfitting is the opposite — a more flexible model or better features. Diagnosing which one you have is half the job.

Key points

Underfitting = too simple (high bias); overfitting = memorising noise (high variance).
A large train-vs-test performance gap is the hallmark of overfitting.
Combat overfitting with more data, simpler models, and regularisation (Ridge/Lasso).

★ Hands-on Project — End-to-End Prediction Model

Build a complete, honestly-evaluated supervised model on a real dataset and document the workflow like a professional.

Choose a dataset with a clear target — regression (e.g. California housing) or classification (e.g. Titanic survival, telco churn).
Split into train/test with a fixed random_state; never touch the test set until the very end.
Build a Pipeline with a ColumnTransformer that scales numeric columns and one-hot encodes categoricals.
Train at least two models (e.g. linear/logistic plus a tree-based model) and compare them with 5-fold cross-validation.
Report the right metrics: RMSE and R² for regression, or precision/recall/F1 and a confusion matrix for classification.
Diagnose fit: compare train vs cross-val scores and state whether the model under- or over-fits.
Apply regularisation or feature changes to improve generalisation, then evaluate ONCE on the held-out test set.
Write a short model card: data, features, metric, performance, limitations — and commit the notebook to your portfolio.

Ready to test yourself?

Take the module quiz. Score 70% or more to mark this module complete.

Start the quiz →

💡 Log in to save your progress and earn the certificate.

← Previous

Statistics & Probability for Data Science

Advanced & Unsupervised Learning