Linear models are honest and interpretable, but the algorithms that win Kaggle competitions and power real products are tree ensembles — random forests and gradient boosting. This module levels you up: first the supervised heavyweights, then the unsupervised toolkit (clustering and dimensionality reduction) for when you have no labels at all, and finally the systematic hyper-parameter tuning that squeezes the last few points out of any model. Same scikit-learn API throughout — you already know how to drive these; now you will know when and why.
1Decision trees: how machines make rules
A decision tree learns a flowchart of yes/no questions, splitting the data to make each group as “pure” as possible. It is the building block of the ensembles that follow, and it is wonderfully interpretable.
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
X, y = load_iris(return_X_y=True)
tree = DecisionTreeClassifier(max_depth=3, random_state=42)
print('CV accuracy:', round(cross_val_score(tree, X, y, cv=5).mean(), 3))CV accuracy: 0.967
max_depth or min_samples_leaf. This weakness is exactly what ensembles fix.- A decision tree splits data with yes/no questions to make groups purer (lower Gini/entropy).
- Trees are interpretable but overfit easily — constrain with
max_depth/min_samples_leaf. - A single tree is the building block of the powerful ensembles that follow.
2Ensembles: random forests & gradient boosting
One tree is weak; many trees together are state-of-the-art. The two great strategies are bagging (build many independent trees on random subsets and average them → random forest) and boosting (build trees in sequence, each fixing the last one's mistakes → gradient boosting).
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import cross_val_score
X, y = load_breast_cancer(return_X_y=True)
rf = RandomForestClassifier(n_estimators=200, random_state=42)
gb = GradientBoostingClassifier(random_state=42)
print('Random Forest :', round(cross_val_score(rf, X, y, cv=5).mean(), 3))
print('Gradient Boosting:', round(cross_val_score(gb, X, y, cv=5).mean(), 3))Random Forest : 0.958 Gradient Boosting: 0.965
- Bagging (random forest) averages many independent trees; boosting builds trees that fix prior errors.
- Ensembles are far more accurate and stable than a single tree.
- Gradient boosting (XGBoost/LightGBM/CatBoost) is the go-to for tabular data — often beats deep learning.
3Feature importance & engineering for ML
Models are only as good as their features. Tree ensembles also tell you which features mattered — invaluable for trust and for trimming noise.
Which features drove the model?
import numpy as np
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
rf.fit(data.data, data.target)
order = np.argsort(rf.feature_importances_)[::-1][:5]
for i in order:
print(f'{data.feature_names[i]:24s} {rf.feature_importances_[i]:.3f}')worst area 0.139 worst concave points 0.132 mean concave points 0.106 worst radius 0.079 worst perimeter 0.071
sklearn.inspection.permutation_importance measures the real drop in performance when a feature is shuffled — a more trustworthy ranking.Engineering better features
- Interactions: combine columns (price × quantity = revenue).
- Ratios & rates: often more predictive than raw counts.
- Date parts: hour, day-of-week, is_holiday from a timestamp.
- Binning & encoding: turn messy continuous/categorical fields into signal.
- Domain features: the biggest wins come from subject knowledge, not algorithms.
- Tree ensembles expose
feature_importances_; prefer permutation importance for a fair ranking. - Engineer interactions, ratios, date parts and domain features to inject signal.
- Good features usually beat fancier models — invest in understanding the domain.
4Unsupervised learning: K-means clustering
With no labels, you find structure. K-means groups points into k clusters by repeatedly assigning each point to the nearest cluster centre, then moving the centres to the mean of their members.
import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.metrics import silhouette_score
X, _ = make_blobs(n_samples=300, centers=3, random_state=42)
km = KMeans(n_clusters=3, n_init=10, random_state=42).fit(X)
print('Cluster sizes:', np.bincount(km.labels_))
print('Inertia :', round(km.inertia_, 1))
print('Silhouette :', round(silhouette_score(X, km.labels_), 3))Cluster sizes: [100 100 100] Inertia : 612.4 Silhouette : 0.842
How many clusters? The elbow & silhouette
# Try k = 1..8 and watch inertia drop, looking for the 'elbow'
for k in range(1, 6):
km = KMeans(n_clusters=k, n_init=10, random_state=42).fit(X)
print(f'k={k} inertia={km.inertia_:8.1f}')k=1 inertia= 5601.2 k=2 inertia= 1788.5 k=3 inertia= 612.4 k=4 inertia= 535.1 k=5 inertia= 470.8
Inertia plunges to k=3, then flattens — the “elbow” at 3 matches the true number of groups.
- K-means iteratively assigns points to the nearest centroid and recomputes centroids.
- Pick k with the elbow (inertia) and silhouette score; scale features first.
- K-means assumes round, similarly-sized clusters and depends on initialisation (use
n_init).
5Dimensionality reduction with PCA
High-dimensional data is hard to visualise and can hurt models (the “curse of dimensionality”). Principal Component Analysis (PCA) finds new axes that capture the most variance, letting you compress many features into a few while keeping most of the information.
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer
X = load_breast_cancer().data # 30 features
Xs = StandardScaler().fit_transform(X) # PCA needs scaled data
pca = PCA(n_components=2).fit(Xs)
print('Explained variance:', pca.explained_variance_ratio_.round(3))
print('Cumulative :', round(pca.explained_variance_ratio_.sum(), 3))Explained variance: [0.443 0.19 ] Cumulative : 0.633
Two components capture 63% of the variance in 30 columns — enough to plot the whole dataset on a 2-D scatter and often enough to speed up a downstream model with little loss.
- PCA projects data onto new axes (components) ordered by how much variance they capture.
- Always scale features first;
explained_variance_ratio_shows how much information each component keeps. - PCA is for compression/visualisation; for 2-D cluster maps, t-SNE or UMAP often look clearer.
6Hyper-parameter tuning
Every model has hyper-parameters — settings you choose, not ones it learns (tree depth, number of estimators, learning rate). Tuning them systematically, with cross-validation, is how you get the most from a model without fooling yourself.
Grid search with cross-validation
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
params = {
'n_estimators': [100, 200],
'max_depth': [None, 5, 10],
'min_samples_leaf': [1, 4],
}
grid = GridSearchCV(RandomForestClassifier(random_state=42),
params, cv=5, scoring='accuracy', n_jobs=-1)
grid.fit(X, y)
print('Best params:', grid.best_params_)
print('Best CV acc:', round(grid.best_score_, 3))Best params: {'max_depth': None, 'min_samples_leaf': 1, 'n_estimators': 200}
Best CV acc: 0.963Grid vs random vs Bayesian
| Method | How it searches | Best for |
|---|---|---|
GridSearchCV | every combination | few parameters |
RandomizedSearchCV | random samples of the space | many parameters |
| Bayesian (Optuna) | learns where to look next | expensive models |
- Hyper-parameters are set by you, not learned; tune them with cross-validation.
GridSearchCVtries every combination;RandomizedSearchCVsamples a large space efficiently.- Tune using CV on the training data only — keep the test set for one final, honest evaluation.
★ Hands-on Project — Beat the Baseline with an Ensemble
Take a dataset, establish a simple baseline, then systematically improve it with ensembles, feature engineering and tuning — and add an unsupervised exploration.
- Load a tabular dataset and set a baseline with logistic/linear regression inside a Pipeline (record the cross-val score).
- Train a random forest and a gradient-boosting model; compare them to the baseline with 5-fold cross-validation.
- Plot feature importance (and permutation importance) and remove or combine the weakest features.
- Engineer at least two new features from domain knowledge and re-measure the improvement.
- Run
GridSearchCV(orRandomizedSearchCV) on your best model and report the chosen hyper-parameters. - Unsupervised side-quest: scale the features, run K-means, choose k with the elbow/silhouette, and describe the clusters.
- Use PCA to project the data to 2-D and scatter-plot it coloured by cluster (or by the label).
- Evaluate your final tuned model ONCE on the held-out test set and write up what helped most, then commit to your portfolio.
Ready to test yourself?
Take the module quiz. Score 70% or more to mark this module complete.
Start the quiz →💡 Log in to save your progress and earn the certificate.