Unsupervised Learning & PCA
Clustering
k-means Clustering
kmeans() partitions data into k clusters by minimising within-cluster sum of squares (WCSS). Use the elbow method to choose k.
set.seed(42)
km <- kmeans(scale(iris[,-5]), centers=3, nstart=25)
km$cluster; km$centers; km$tot.withinss
# Elbow plot to choose k:
wss <- sapply(1:10, function(k) kmeans(scale(iris[,-5]),k,nstart=25)$tot.withinss)
plot(1:10, wss, type='b', xlab='k', ylab='WSS')
# k-means on iris (3 clusters, ignoring Species label)
set.seed(42)
km <- kmeans(iris[, 1:4], centers = 3, nstart = 25)
km$centers # cluster centroids
| Cluster | Sepal.Length | Sepal.Width | Petal.Length | Petal.Width |
|---|---|---|---|---|
| 1 (setosa) | 5.01 | 3.43 | 1.46 | 0.25 |
| 2 (versicolor) | 5.90 | 2.75 | 4.39 | 1.43 |
| 3 (virginica) | 6.85 | 3.07 | 5.74 | 2.07 |
# WSS elbow plot — choosing optimal k
wss <- sapply(1:8, function(k) {
kmeans(iris[,1:4], centers=k, nstart=25)$tot.withinss
})
plot(1:8, wss, type="b", pch=19, col="#1d4ed8",
xlab="Number of Clusters (k)", ylab="Total WSS",
main="Elbow Method — Optimal k for iris")
# Cluster scatter: Petal.Length vs Petal.Width, coloured by k-means cluster
library(ggplot2)
iris$cluster <- factor(km$cluster)
ggplot(iris, aes(Petal.Length, Petal.Width, color=cluster, shape=Species)) +
geom_point(size=2.5, alpha=0.8) +
labs(title="k-means Clusters vs True Species (iris)",
x="Petal Length", y="Petal Width") +
theme_minimal()
# See the code example above and adapt it to your data. # Always check your output with str() and head().
Dimensionality Reduction
PCA — Principal Component Analysis
prcomp() performs PCA. The first few PCs capture most variance. Use for visualisation and feature reduction before modelling.
pca <- prcomp(iris[,-5], scale.=TRUE)
summary(pca) # proportion of variance explained
biplot(pca) # loadings and scores
library(factoextra)
fviz_eig(pca) # scree plot
fviz_pca_ind(pca, habillage=iris$Species, addEllipses=TRUE)
# PCA on iris — reduce 4 dimensions to 2
pca <- prcomp(iris[,1:4], scale. = TRUE)
summary(pca)
Importance of components:
PC1 PC2 PC3 PC4
Standard deviation 1.7084 0.9560 0.3831 0.1439
Proportion of Variance 0.7296 0.2285 0.0367 0.0052
Cumulative Proportion 0.7296 0.9581 0.9948 1.0000# PCA biplot: PC1 explains 73%, PC2 explains 23% — 96% total variance in 2D
pca_df <- data.frame(pca$x[,1:2], Species=iris$Species)
ggplot(pca_df, aes(PC1, PC2, color=Species)) +
geom_point(size=2.5, alpha=0.8) +
labs(title="PCA — iris Dataset (96% variance in 2 components)",
x="PC1 (73.0%)", y="PC2 (22.9%)") +
theme_minimal()
# See the code example above and adapt it to your data. # Always check your output with str() and head().