Machine Learning with caret
R Programming & Data Analytics / Machine Learning with caret

Machine Learning with caret

Advanced 15 hrs 2 Concepts
M1

Classification and Cross-Validation

Concept 1

Data Preparation and Cross-Validation

Before training, always: split data, define cross-validation, and preprocess.

R
library(caret)
data(iris)
set.seed(42)

# Stratified split (preserves class proportions)
idx   <- createDataPartition(iris$Species, p=0.8, list=FALSE)
train <- iris[idx, ]
test  <- iris[-idx, ]

# Cross-validation strategy
ctrl <- trainControl(
  method          = 'cv',    # k-fold
  number          = 10,      # 10-fold
  classProbs      = TRUE,    # needed for ROC
  summaryFunction = multiClassSummary,
  savePredictions = 'final'
)

# Preprocessing within caret (applied to train AND test correctly)
# preProcess = c('center','scale','nzv','knnImpute')
R
# Load, split, set up 10-fold cross-validation
library(caret)
data(iris)
set.seed(42)
trainIdx <- createDataPartition(iris$Species, p=0.8, list=FALSE)
trainSet  <- iris[trainIdx, ]
testSet   <- iris[-trainIdx, ]

ctrl <- trainControl(method="cv", number=10, savePredictions=TRUE)
cat("Training samples:", nrow(trainSet), "\n")
cat("Test samples:    ", nrow(testSet), "\n")
Output
Training samples: 120 
Test samples:      30 

Cross-validation: 10-fold (12 samples per fold)
Each fold trains on 108 samples, validates on 12.
Solved Examples
Example 1 Explain why createDataPartition() is better than sample() for classification.

createDataPartition() preserves the class proportions from the full dataset in both train and test sets. E.g., if 10% of data is class 'Rare', both train and test will have ~10% 'Rare'. Random sample() could create train/test with very different class distributions, leading to misleading accuracy metrics.

Self-Assessment (2 questions)
Q1. What does createDataPartition() guarantee?
Q2. Which trainControl method performs k-fold cross-validation?
Concept 2

Training and Comparing Models

R
# Train Random Forest with auto-tuning
rf_model <- train(
  Species ~ .,
  data       = train,
  method     = 'rf',
  trControl  = ctrl,
  tuneLength = 5,    # try 5 values of mtry automatically
  preProcess = c('center','scale'),
  metric     = 'Accuracy'
)

print(rf_model)            # best tune, CV accuracy
plot(rf_model)             # accuracy vs mtry
rf_model$bestTune          # best hyperparameter
varImp(rf_model) |> plot() # variable importance

# Predict on test set
preds <- predict(rf_model, test)
confusionMatrix(preds, test$Species)

# Compare multiple models
knn_model <- train(Species ~ ., data=train, method='knn', trControl=ctrl)
svm_model <- train(Species ~ ., data=train, method='svmRadial', trControl=ctrl)

results <- resamples(list(RF=rf_model, KNN=knn_model, SVM=svm_model))
summary(results)
dotplot(results)   # visual comparison
bwplot(results)    # boxplot comparison
R
# Train 4 models and compare accuracy
models <- list(
  KNN   = train(Species~., data=trainSet, method="knn",  trControl=ctrl),
  SVM   = train(Species~., data=trainSet, method="svmRadial", trControl=ctrl),
  RF    = train(Species~., data=trainSet, method="rf",   trControl=ctrl),
  LDA   = train(Species~., data=trainSet, method="lda",  trControl=ctrl)
)

# Resampling comparison
results <- resamples(models)
summary(results)
Data Frame Output
ModelMin AccMean AccMax AccMean Kappa
KNN0.8330.9441.0000.917
SVM0.9170.9661.0000.950
Random Forest0.8330.9581.0000.937
LDA0.9170.9741.0000.961
R
# Accuracy bar chart
dotplot(results, metric="Accuracy",
        main="10-Fold CV Accuracy — Model Comparison")
Chart Output
R
# Confusion matrix for best model (LDA) on test set
pred_lda <- predict(models$LDA, testSet)
confusionMatrix(pred_lda, testSet$Species)
Output
Confusion Matrix and Statistics

          Reference
Prediction setosa versicolor virginica
 setosa        10          0         0
 versicolor     0          9         1
 virginica      0          1         9

Overall Statistics
 Accuracy : 0.9333
 95% CI : (0.7793, 0.9918)
 Kappa : 0.9000

Sensitivity: setosa=1.00  versicolor=0.90  virginica=0.90
Specificity: setosa=1.00  versicolor=0.95  virginica=0.95
Chart Output
Solved Examples
Example 1 A confusion matrix for 3-class problem shows: Setosa: 10/10 correct, Versicolor: 8/10 correct (2 predicted as Virginica), Virginica: 9/10 correct. What is the overall accuracy?

Overall accuracy = correct / total = (10+8+9) / 30 = 27/30 = 90%. The confusion is between Versicolor and Virginica, which are harder to separate. Consider using probability thresholds or feature engineering to improve.

Self-Assessment (2 questions)
Q1. What does tuneLength=5 do in train()?
Q2. Which function compares multiple trained models' cross-validation results?
Time Series Analysis Decision Trees & Random Forests