Unsupervised Learning & PCA

Advanced 10 hrs 2 Concepts

Your Learning Map

📌 You already know

You can build supervised models that learn from labels.

🎯 You'll learn here

Finding structure without labels — k-means clustering and PCA for dimensionality reduction.

🌍 Where it's used

Customer segmentation, anomaly detection and compressing wide datasets.

🔗 Unlocks next

Clustering pairs naturally with text mining, where features are word counts.

Clustering

Concept 1

k-means Clustering

kmeans() partitions data into k clusters by minimising within-cluster sum of squares (WCSS). Use the elbow method to choose k.

set.seed(42)
km <- kmeans(scale(iris[,-5]), centers=3, nstart=25)
km$cluster; km$centers; km$tot.withinss
# Elbow plot to choose k:
wss <- sapply(1:10, function(k) kmeans(scale(iris[,-5]),k,nstart=25)$tot.withinss)
plot(1:10, wss, type='b', xlab='k', ylab='WSS')

# k-means on iris (3 clusters, ignoring Species label)
set.seed(42)
km <- kmeans(iris[, 1:4], centers = 3, nstart = 25)
km$centers  # cluster centroids

Data Frame Output

Cluster	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width
1 (setosa)	5.01	3.43	1.46	0.25
2 (versicolor)	5.90	2.75	4.39	1.43
3 (virginica)	6.85	3.07	5.74	2.07

# WSS elbow plot — choosing optimal k
wss <- sapply(1:8, function(k) {
  kmeans(iris[,1:4], centers=k, nstart=25)$tot.withinss
})
plot(1:8, wss, type="b", pch=19, col="#1d4ed8",
     xlab="Number of Clusters (k)", ylab="Total WSS",
     main="Elbow Method — Optimal k for iris")

Chart Output

# Cluster scatter: Petal.Length vs Petal.Width, coloured by k-means cluster
library(ggplot2)
iris$cluster <- factor(km$cluster)
ggplot(iris, aes(Petal.Length, Petal.Width, color=cluster, shape=Species)) +
  geom_point(size=2.5, alpha=0.8) +
  labs(title="k-means Clusters vs True Species (iris)",
       x="Petal Length", y="Petal Width") +
  theme_minimal()

Chart Output

R — k-means cluster scatter LIVE READY

set.seed(1)
km <- kmeans(iris[, 1:4], centers = 3)
km$size
plot(iris$Petal.Length, iris$Petal.Width, col = km$cluster, pch = 19,
     xlab = "Petal length", ylab = "Petal width", main = "k-means clusters (k = 3)")

Output below is verified. Click to run real R in your browser (first run loads ~20 MB once).

Output (verified)

[1] 62 38 50

Solved Examples

Example 1 Apply the concept of k-means Clustering to a sample dataset. Show at least two approaches.

# See the code example above and adapt it to your data. # Always check your output with str() and head().

Self-Assessment (2 questions)

Q1. k-means clustering requires you to specify:

k-means needs k, the number of clusters, chosen in advance (e.g. via the elbow method).

Q2. k-means is an example of:

Clustering finds structure without labels, so it is unsupervised.

Dimensionality Reduction

Concept 1

PCA — Principal Component Analysis

prcomp() performs PCA. The first few PCs capture most variance. Use for visualisation and feature reduction before modelling.

pca <- prcomp(iris[,-5], scale.=TRUE)
summary(pca)      # proportion of variance explained
biplot(pca)       # loadings and scores
library(factoextra)
fviz_eig(pca)     # scree plot
fviz_pca_ind(pca, habillage=iris$Species, addEllipses=TRUE)

# PCA on iris — reduce 4 dimensions to 2
pca <- prcomp(iris[,1:4], scale. = TRUE)
summary(pca)

Output

Importance of components:
                          PC1    PC2    PC3    PC4
Standard deviation     1.7084 0.9560 0.3831 0.1439
Proportion of Variance 0.7296 0.2285 0.0367 0.0052
Cumulative Proportion  0.7296 0.9581 0.9948 1.0000

# PCA biplot: PC1 explains 73%, PC2 explains 23% — 96% total variance in 2D
pca_df <- data.frame(pca$x[,1:2], Species=iris$Species)
ggplot(pca_df, aes(PC1, PC2, color=Species)) +
  geom_point(size=2.5, alpha=0.8) +
  labs(title="PCA — iris Dataset (96% variance in 2 components)",
       x="PC1 (73.0%)", y="PC2 (22.9%)") +
  theme_minimal()

Chart Output

Solved Examples

Example 1 Apply the concept of PCA — Principal Component Analysis to a sample dataset. Show at least two approaches.

# See the code example above and adapt it to your data. # Always check your output with str() and head().

Self-Assessment (2 questions)

Q1. PCA is mainly used to:

PCA projects data onto fewer components that capture most of the variance.

Q2. The first principal component is the direction that:

PC1 is the axis of greatest variance; later components capture progressively less.

Decision Trees & Random Forests Text Mining & NLP with R