Unsupervised Learning & PCA
R Programming & Data Analytics / Unsupervised Learning & PCA

Unsupervised Learning & PCA

Advanced 10 hrs 2 Concepts
M1

Clustering

Concept 1

k-means Clustering

kmeans() partitions data into k clusters by minimising within-cluster sum of squares (WCSS). Use the elbow method to choose k.

R
set.seed(42)
km <- kmeans(scale(iris[,-5]), centers=3, nstart=25)
km$cluster; km$centers; km$tot.withinss
# Elbow plot to choose k:
wss <- sapply(1:10, function(k) kmeans(scale(iris[,-5]),k,nstart=25)$tot.withinss)
plot(1:10, wss, type='b', xlab='k', ylab='WSS')
R
# k-means on iris (3 clusters, ignoring Species label)
set.seed(42)
km <- kmeans(iris[, 1:4], centers = 3, nstart = 25)
km$centers  # cluster centroids
Data Frame Output
ClusterSepal.LengthSepal.WidthPetal.LengthPetal.Width
1 (setosa)5.013.431.460.25
2 (versicolor)5.902.754.391.43
3 (virginica)6.853.075.742.07
R
# WSS elbow plot — choosing optimal k
wss <- sapply(1:8, function(k) {
  kmeans(iris[,1:4], centers=k, nstart=25)$tot.withinss
})
plot(1:8, wss, type="b", pch=19, col="#1d4ed8",
     xlab="Number of Clusters (k)", ylab="Total WSS",
     main="Elbow Method — Optimal k for iris")
Chart Output
R
# Cluster scatter: Petal.Length vs Petal.Width, coloured by k-means cluster
library(ggplot2)
iris$cluster <- factor(km$cluster)
ggplot(iris, aes(Petal.Length, Petal.Width, color=cluster, shape=Species)) +
  geom_point(size=2.5, alpha=0.8) +
  labs(title="k-means Clusters vs True Species (iris)",
       x="Petal Length", y="Petal Width") +
  theme_minimal()
Chart Output
Solved Examples
Example 1 Apply the concept of k-means Clustering to a sample dataset. Show at least two approaches.

# See the code example above and adapt it to your data. # Always check your output with str() and head().

Self-Assessment (2 questions)
Q1. What is the primary purpose of k-means clustering?
Q2. Which R package is most relevant for this topic?
M2

Dimensionality Reduction

Concept 1

PCA — Principal Component Analysis

prcomp() performs PCA. The first few PCs capture most variance. Use for visualisation and feature reduction before modelling.

R
pca <- prcomp(iris[,-5], scale.=TRUE)
summary(pca)      # proportion of variance explained
biplot(pca)       # loadings and scores
library(factoextra)
fviz_eig(pca)     # scree plot
fviz_pca_ind(pca, habillage=iris$Species, addEllipses=TRUE)
R
# PCA on iris — reduce 4 dimensions to 2
pca <- prcomp(iris[,1:4], scale. = TRUE)
summary(pca)
Output
Importance of components:
                          PC1    PC2    PC3    PC4
Standard deviation     1.7084 0.9560 0.3831 0.1439
Proportion of Variance 0.7296 0.2285 0.0367 0.0052
Cumulative Proportion  0.7296 0.9581 0.9948 1.0000
R
# PCA biplot: PC1 explains 73%, PC2 explains 23% — 96% total variance in 2D
pca_df <- data.frame(pca$x[,1:2], Species=iris$Species)
ggplot(pca_df, aes(PC1, PC2, color=Species)) +
  geom_point(size=2.5, alpha=0.8) +
  labs(title="PCA — iris Dataset (96% variance in 2 components)",
       x="PC1 (73.0%)", y="PC2 (22.9%)") +
  theme_minimal()
Chart Output
Solved Examples
Example 1 Apply the concept of PCA — Principal Component Analysis to a sample dataset. Show at least two approaches.

# See the code example above and adapt it to your data. # Always check your output with str() and head().

Self-Assessment (2 questions)
Q1. What is the primary purpose of pca — principal component analysis?
Q2. Which R package is most relevant for this topic?
Decision Trees & Random Forests Text Mining & NLP with R