🌳 Module 6

Advanced & Unsupervised Learning

⏱ 16 hoursIntermediate–Advanced6 topics
🎯 By the end: train decision trees and powerful ensembles (random forests, gradient boosting), read feature importance, cluster unlabelled data with K-means, reduce dimensions with PCA, and tune hyper-parameters systematically.

Linear models are honest and interpretable, but the algorithms that win Kaggle competitions and power real products are tree ensembles — random forests and gradient boosting. This module levels you up: first the supervised heavyweights, then the unsupervised toolkit (clustering and dimensionality reduction) for when you have no labels at all, and finally the systematic hyper-parameter tuning that squeezes the last few points out of any model. Same scikit-learn API throughout — you already know how to drive these; now you will know when and why.

1Decision trees: how machines make rules

A decision tree learns a flowchart of yes/no questions, splitting the data to make each group as “pure” as possible. It is the building block of the ensembles that follow, and it is wonderfully interpretable.

petal length < 2.5?gini = 0.66yesnoSetosapure leaf (gini = 0)petal width < 1.8?gini = 0.5VersicolorVirginica
Each node asks one question; data flows down to a leaf that holds the prediction.
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score

X, y = load_iris(return_X_y=True)
tree = DecisionTreeClassifier(max_depth=3, random_state=42)
print('CV accuracy:', round(cross_val_score(tree, X, y, cv=5).mean(), 3))
▶ Output
CV accuracy: 0.967
A deep tree overfits. Left unconstrained, a tree grows until every leaf is one sample — memorising the training set. Limit it with max_depth or min_samples_leaf. This weakness is exactly what ensembles fix.
Key points
  • A decision tree splits data with yes/no questions to make groups purer (lower Gini/entropy).
  • Trees are interpretable but overfit easily — constrain with max_depth/min_samples_leaf.
  • A single tree is the building block of the powerful ensembles that follow.

2Ensembles: random forests & gradient boosting

One tree is weak; many trees together are state-of-the-art. The two great strategies are bagging (build many independent trees on random subsets and average them → random forest) and boosting (build trees in sequence, each fixing the last one's mistakes → gradient boosting).

tree 1tree 2tree 3... tree NVote /Average
A random forest averages many decorrelated trees — far more accurate and stable than one.
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import cross_val_score

X, y = load_breast_cancer(return_X_y=True)

rf = RandomForestClassifier(n_estimators=200, random_state=42)
gb = GradientBoostingClassifier(random_state=42)

print('Random Forest    :', round(cross_val_score(rf, X, y, cv=5).mean(), 3))
print('Gradient Boosting:', round(cross_val_score(gb, X, y, cv=5).mean(), 3))
▶ Output
Random Forest    : 0.958
Gradient Boosting: 0.965
For tabular data, gradient boosting is the default to beat. Libraries like XGBoost, LightGBM and CatBoost are tuned, faster implementations and dominate real-world structured-data problems. Reach for them before deep learning on spreadsheet-shaped data.
Key points
  • Bagging (random forest) averages many independent trees; boosting builds trees that fix prior errors.
  • Ensembles are far more accurate and stable than a single tree.
  • Gradient boosting (XGBoost/LightGBM/CatBoost) is the go-to for tabular data — often beats deep learning.

3Feature importance & engineering for ML

Models are only as good as their features. Tree ensembles also tell you which features mattered — invaluable for trust and for trimming noise.

Which features drove the model?

import numpy as np
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
rf.fit(data.data, data.target)

order = np.argsort(rf.feature_importances_)[::-1][:5]
for i in order:
    print(f'{data.feature_names[i]:24s} {rf.feature_importances_[i]:.3f}')
▶ Output
worst area               0.139
worst concave points     0.132
mean concave points      0.106
worst radius             0.079
worst perimeter          0.071
Use permutation importance for honesty. Built-in tree importance can favour high-cardinality features. sklearn.inspection.permutation_importance measures the real drop in performance when a feature is shuffled — a more trustworthy ranking.

Engineering better features

  • Interactions: combine columns (price × quantity = revenue).
  • Ratios & rates: often more predictive than raw counts.
  • Date parts: hour, day-of-week, is_holiday from a timestamp.
  • Binning & encoding: turn messy continuous/categorical fields into signal.
  • Domain features: the biggest wins come from subject knowledge, not algorithms.
Feature engineering beats model-hunting. A thoughtful new feature usually helps more than swapping algorithms. Spend your time understanding the problem domain — that is where competitions and real projects are won.
Key points
  • Tree ensembles expose feature_importances_; prefer permutation importance for a fair ranking.
  • Engineer interactions, ratios, date parts and domain features to inject signal.
  • Good features usually beat fancier models — invest in understanding the domain.

4Unsupervised learning: K-means clustering

With no labels, you find structure. K-means groups points into k clusters by repeatedly assigning each point to the nearest cluster centre, then moving the centres to the mean of their members.

import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.metrics import silhouette_score

X, _ = make_blobs(n_samples=300, centers=3, random_state=42)

km = KMeans(n_clusters=3, n_init=10, random_state=42).fit(X)
print('Cluster sizes:', np.bincount(km.labels_))
print('Inertia      :', round(km.inertia_, 1))
print('Silhouette   :', round(silhouette_score(X, km.labels_), 3))
▶ Output
Cluster sizes: [100 100 100]
Inertia      : 612.4
Silhouette   : 0.842
K-means assigns each point to its nearest centroid (the crosses); silhouette near 1 means tight, well-separated clusters.

How many clusters? The elbow & silhouette

# Try k = 1..8 and watch inertia drop, looking for the 'elbow'
for k in range(1, 6):
    km = KMeans(n_clusters=k, n_init=10, random_state=42).fit(X)
    print(f'k={k}  inertia={km.inertia_:8.1f}')
▶ Output
k=1  inertia= 5601.2
k=2  inertia= 1788.5
k=3  inertia=  612.4
k=4  inertia=  535.1
k=5  inertia=  470.8

Inertia plunges to k=3, then flattens — the “elbow” at 3 matches the true number of groups.

Always scale before K-means. It uses Euclidean distance, so a column measured in thousands will dominate one measured in fractions. Standardise features first, and remember K-means assumes roughly round, equal-size clusters.
Key points
  • K-means iteratively assigns points to the nearest centroid and recomputes centroids.
  • Pick k with the elbow (inertia) and silhouette score; scale features first.
  • K-means assumes round, similarly-sized clusters and depends on initialisation (use n_init).

5Dimensionality reduction with PCA

High-dimensional data is hard to visualise and can hurt models (the “curse of dimensionality”). Principal Component Analysis (PCA) finds new axes that capture the most variance, letting you compress many features into a few while keeping most of the information.

PC1PC2
PC1 is the direction of greatest variance; PC2 is perpendicular. Most information lives along PC1.
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer

X = load_breast_cancer().data           # 30 features
Xs = StandardScaler().fit_transform(X)  # PCA needs scaled data

pca = PCA(n_components=2).fit(Xs)
print('Explained variance:', pca.explained_variance_ratio_.round(3))
print('Cumulative        :', round(pca.explained_variance_ratio_.sum(), 3))
▶ Output
Explained variance: [0.443 0.19 ]
Cumulative        : 0.633

Two components capture 63% of the variance in 30 columns — enough to plot the whole dataset on a 2-D scatter and often enough to speed up a downstream model with little loss.

PCA is for compression and visualisation, not interpretation. The components are blends of original features, so they are hard to name. For a 2-D map you can actually see (great for clusters), try t-SNE or UMAP instead.
Key points
  • PCA projects data onto new axes (components) ordered by how much variance they capture.
  • Always scale features first; explained_variance_ratio_ shows how much information each component keeps.
  • PCA is for compression/visualisation; for 2-D cluster maps, t-SNE or UMAP often look clearer.

6Hyper-parameter tuning

Every model has hyper-parameters — settings you choose, not ones it learns (tree depth, number of estimators, learning rate). Tuning them systematically, with cross-validation, is how you get the most from a model without fooling yourself.

Grid search with cross-validation

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

params = {
    'n_estimators': [100, 200],
    'max_depth':    [None, 5, 10],
    'min_samples_leaf': [1, 4],
}

grid = GridSearchCV(RandomForestClassifier(random_state=42),
                    params, cv=5, scoring='accuracy', n_jobs=-1)
grid.fit(X, y)

print('Best params:', grid.best_params_)
print('Best CV acc:', round(grid.best_score_, 3))
▶ Output
Best params: {'max_depth': None, 'min_samples_leaf': 1, 'n_estimators': 200}
Best CV acc: 0.963

Grid vs random vs Bayesian

MethodHow it searchesBest for
GridSearchCVevery combinationfew parameters
RandomizedSearchCVrandom samples of the spacemany parameters
Bayesian (Optuna)learns where to look nextexpensive models
Never tune on the test set. Grid search uses cross-validation within the training data. Keep your final test set untouched until the very end — otherwise the score you report is optimistic and will not hold in production.
Key points
  • Hyper-parameters are set by you, not learned; tune them with cross-validation.
  • GridSearchCV tries every combination; RandomizedSearchCV samples a large space efficiently.
  • Tune using CV on the training data only — keep the test set for one final, honest evaluation.

★ Hands-on Project — Beat the Baseline with an Ensemble

Take a dataset, establish a simple baseline, then systematically improve it with ensembles, feature engineering and tuning — and add an unsupervised exploration.

  1. Load a tabular dataset and set a baseline with logistic/linear regression inside a Pipeline (record the cross-val score).
  2. Train a random forest and a gradient-boosting model; compare them to the baseline with 5-fold cross-validation.
  3. Plot feature importance (and permutation importance) and remove or combine the weakest features.
  4. Engineer at least two new features from domain knowledge and re-measure the improvement.
  5. Run GridSearchCV (or RandomizedSearchCV) on your best model and report the chosen hyper-parameters.
  6. Unsupervised side-quest: scale the features, run K-means, choose k with the elbow/silhouette, and describe the clusters.
  7. Use PCA to project the data to 2-D and scatter-plot it coloured by cluster (or by the label).
  8. Evaluate your final tuned model ONCE on the held-out test set and write up what helped most, then commit to your portfolio.

Ready to test yourself?

Take the module quiz. Score 70% or more to mark this module complete.

Start the quiz →

💡 Log in to save your progress and earn the certificate.