This is the module the whole course has been building toward. Machine learning is teaching a computer to find patterns from examples instead of hard-coded rules — and scikit-learn is the library that made it accessible to everyone. The magic is a single, consistent design: every model is an estimator with the same fit / predict interface, so once you learn one, you know them all. We will keep the focus where it belongs for a beginner: a clean workflow, honest evaluation, and avoiding the traps that make models look good in a notebook and fail in the real world.
1What is machine learning? The train/test split
ML comes in two main flavours. In supervised learning you have labelled examples (input → known answer) and learn to predict the answer for new inputs. In unsupervised learning there are no labels — you find structure (clusters, themes) in the data itself. This module is all supervised; Module 6 covers unsupervised.
| Type | You have | You predict | Example |
|---|---|---|---|
| Regression | labels that are numbers | a number | house price, demand |
| Classification | labels that are categories | a category | spam / not spam, churn |
| Clustering | no labels | groups | customer segments |
The golden rule: never test on training data
A model that has seen the answers can memorise them. To measure real performance you must hold out data it never saw during training.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
print('Train rows:', len(X_train), ' Test rows:', len(X_test))Train rows: 160 Test rows: 40
- Supervised learning predicts a known label; regression predicts numbers, classification predicts categories.
- Unsupervised learning finds structure (clusters) without labels.
- Always hold out a test set — never evaluate a model on data it trained on.
2Your first model: linear regression
Linear regression predicts a number as a weighted sum of features — exactly the dot product from Module 1, now fitted automatically. It is the “hello world” of ML and shows the scikit-learn pattern you will reuse for every model.
The estimator API: fit → predict → score
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
# TV ad spend ($k) -> product sales ($k)
X = np.array([[230],[44],[17],[151],[180],[8],[57],[120],[199],[66]])
y = np.array([22, 10, 9, 18, 19, 6, 11, 16, 21, 12])
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=42)
model = LinearRegression()
model.fit(X_tr, y_tr) # 1. learn
preds = model.predict(X_te) # 2. predict
print('Slope (per $1k TV):', round(model.coef_[0], 4))
print('Intercept :', round(model.intercept_, 2))
print('Test R2 :', round(model.score(X_te, y_te), 3))Slope (per $1k TV): 0.0731 Intercept : 6.21 Test R2 : 0.901
The model learned that each extra $1,000 of TV spend lifts sales by about 0.073 units, and it explains ~90% of the variation in the held-out test set. Every scikit-learn model — from a random forest to a neural net — uses this same fit/predict/score trio.
train_test_split shuffles, a different random_state gives slightly different coefficients. Setting random_state=42 makes the run reproducible — always pin it in examples and experiments.- Every scikit-learn model follows the same API:
fit(X, y), thenpredict(X), thenscore. - Linear regression fits a weighted sum of features;
coef_andintercept_hold what it learned. - Pin
random_statefor reproducible splits and results.
3Classification & the confusion matrix
When the label is a category, you classify. Logistic regression — despite its name — is the standard first classifier: it outputs a probability between 0 and 1, then thresholds it into a class.
Train a classifier
from sklearn.linear_model import LogisticRegression
# hours studied -> passed (1) or failed (0)
hours = [[1],[2],[3],[4],[5],[6],[7],[8],[9],[10]]
passed = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
clf = LogisticRegression().fit(hours, passed)
print('P(pass | 4.5 hrs):', round(clf.predict_proba([[4.5]])[0, 1], 3))
print('Predict for 7 hrs :', clf.predict([[7]])[0])P(pass | 4.5 hrs): 0.503 Predict for 7 hrs : 1
Read the confusion matrix
Accuracy alone hides where a classifier fails. The confusion matrix shows the four outcomes: true/false × positive/negative.
from sklearn.metrics import confusion_matrix
y_true = [0, 0, 0, 1, 1, 1, 1, 0, 1, 1]
y_pred = [0, 0, 1, 1, 1, 0, 1, 0, 1, 1]
print(confusion_matrix(y_true, y_pred))[[3 1] [1 5]]
- Logistic regression outputs a probability via
predict_proba, then thresholds to a class. - The confusion matrix breaks results into TN, FP, FN, TP — far more informative than accuracy.
- On imbalanced data, judge with precision and recall, not accuracy.
4Evaluating models honestly: metrics & cross-validation
Pick the metric that matches the cost of being wrong, and validate on more than one lucky split.
Classification metrics in one report
from sklearn.metrics import classification_report
print(classification_report(y_true, y_pred, digits=2)) precision recall f1-score support
0 0.75 0.75 0.75 4
1 0.83 0.83 0.83 6
accuracy 0.80 10
macro avg 0.79 0.79 0.79 10| Metric | Question it answers | Use when |
|---|---|---|
| Precision | Of flagged positives, how many were right? | false alarms are costly |
| Recall | Of real positives, how many did we catch? | misses are costly (disease, fraud) |
| F1 | Balance of precision & recall | you need both |
| RMSE / R² | Regression error / variance explained | predicting numbers |
Cross-validation: don't trust one split
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5, scoring='r2')
print('Fold scores:', scores.round(2))
print(f'Mean R2: {scores.mean():.3f} (+/- {scores.std():.3f})')Fold scores: [0.88 0.91 0.85 0.93 0.89] Mean R2: 0.892 (+/- 0.028)
- Choose metrics by the cost of errors: precision (false alarms), recall (misses), F1 (both), RMSE/R² (regression).
classification_reportgives precision, recall and F1 per class at a glance.- k-fold cross-validation rotates the validation fold for a stable mean ± spread estimate.
5Pipelines & preprocessing (no data leakage)
Most models need scaled numbers and encoded categories. The danger is data leakage — letting test information sneak into training (e.g. scaling using the test set's mean). The Pipeline bundles preprocessing with the model so the same fitted transforms apply correctly within each fold.
A clean pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
pipe = Pipeline([
('scale', StandardScaler()), # fit on train fold only
('model', LogisticRegression()),
])
pipe.fit(X_tr, y_tr)
print('Test accuracy:', round(pipe.score(X_te, y_te), 3))Test accuracy: 0.95
Mixed data types with ColumnTransformer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
pre = ColumnTransformer([
('num', StandardScaler(), ['age', 'income']),
('cat', OneHotEncoder(handle_unknown='ignore'), ['city']),
])
full = Pipeline([('prep', pre), ('model', LogisticRegression())])
# full.fit(X_train, y_train) -> one object does everything, leak-free- Data leakage = test information influencing training; it inflates scores and breaks in production.
Pipelinechains preprocessing + model so transforms fit only on training data per fold.ColumnTransformerapplies scaling to numeric and one-hot encoding to categorical columns together.
6Overfitting, underfitting & the bias–variance trade-off
The central tension in ML. An underfit model is too simple to capture the pattern (high bias). An overfit model memorises noise and fails on new data (high variance). The sweet spot generalises.
Spot it: train vs test gap
print('Train R2:', round(model.score(X_tr, y_tr), 3))
print('Test R2:', round(model.score(X_te, y_te), 3))Train R2: 0.918 Test R2: 0.901
A small gap (here ~0.02) means the model generalises. A large gap — superb on train, poor on test — is the signature of overfitting.
Regularisation: penalise complexity
from sklearn.linear_model import Ridge
# alpha controls the penalty: higher = simpler, less overfitting
ridge = Ridge(alpha=1.0).fit(X_tr, y_tr)
print('Ridge test R2:', round(ridge.score(X_te, y_te), 3))Ridge test R2: 0.904
- Underfitting = too simple (high bias); overfitting = memorising noise (high variance).
- A large train-vs-test performance gap is the hallmark of overfitting.
- Combat overfitting with more data, simpler models, and regularisation (Ridge/Lasso).
★ Hands-on Project — End-to-End Prediction Model
Build a complete, honestly-evaluated supervised model on a real dataset and document the workflow like a professional.
- Choose a dataset with a clear target — regression (e.g. California housing) or classification (e.g. Titanic survival, telco churn).
- Split into train/test with a fixed
random_state; never touch the test set until the very end. - Build a
Pipelinewith aColumnTransformerthat scales numeric columns and one-hot encodes categoricals. - Train at least two models (e.g. linear/logistic plus a tree-based model) and compare them with 5-fold cross-validation.
- Report the right metrics: RMSE and R² for regression, or precision/recall/F1 and a confusion matrix for classification.
- Diagnose fit: compare train vs cross-val scores and state whether the model under- or over-fits.
- Apply regularisation or feature changes to improve generalisation, then evaluate ONCE on the held-out test set.
- Write a short model card: data, features, metric, performance, limitations — and commit the notebook to your portfolio.
Ready to test yourself?
Take the module quiz. Score 70% or more to mark this module complete.
Start the quiz →💡 Log in to save your progress and earn the certificate.