Machine Learning with caret

Advanced 15 hrs 2 Concepts

Your Learning Map

📌 You already know

You can fit and evaluate regression models.

🎯 You'll learn here

A unified ML workflow with caret — data prep, cross-validation, and comparing models.

🌍 Where it's used

The disciplined train/validate/test loop behind every credible predictive model.

🔗 Unlocks next

Underpins the algorithms in Trees & Forests and beyond.

Classification and Cross-Validation

Concept 1

Data Preparation and Cross-Validation

Before training, always: split data, define cross-validation, and preprocess.

library(caret)
data(iris)
set.seed(42)

# Stratified split (preserves class proportions)
idx   <- createDataPartition(iris$Species, p=0.8, list=FALSE)
train <- iris[idx, ]
test  <- iris[-idx, ]

# Cross-validation strategy
ctrl <- trainControl(
  method          = 'cv',    # k-fold
  number          = 10,      # 10-fold
  classProbs      = TRUE,    # needed for ROC
  summaryFunction = multiClassSummary,
  savePredictions = 'final'
)

# Preprocessing within caret (applied to train AND test correctly)
# preProcess = c('center','scale','nzv','knnImpute')

# Load, split, set up 10-fold cross-validation
library(caret)
data(iris)
set.seed(42)
trainIdx <- createDataPartition(iris$Species, p=0.8, list=FALSE)
trainSet  <- iris[trainIdx, ]
testSet   <- iris[-trainIdx, ]

ctrl <- trainControl(method="cv", number=10, savePredictions=TRUE)
cat("Training samples:", nrow(trainSet), "\n")
cat("Test samples:    ", nrow(testSet), "\n")

Output

Training samples: 120 
Test samples:      30 

Cross-validation: 10-fold (12 samples per fold)
Each fold trains on 108 samples, validates on 12.

R — A reproducible train/test split LIVE READY

set.seed(42)
idx <- sample(nrow(iris), 0.7 * nrow(iris))
c(train = length(idx), test = nrow(iris) - length(idx))

Output below is verified. Click to run real R in your browser (first run loads ~20 MB once).

Output (verified)

train  test 
  105    45

Solved Examples

Example 1 Explain why createDataPartition() is better than sample() for classification.

createDataPartition() preserves the class proportions from the full dataset in both train and test sets. E.g., if 10% of data is class 'Rare', both train and test will have ~10% 'Rare'. Random sample() could create train/test with very different class distributions, leading to misleading accuracy metrics.

Self-Assessment (2 questions)

Q1. What does createDataPartition() guarantee?

Stratified sampling ensures class proportions are preserved, which is critical for imbalanced datasets.

Q2. Which trainControl method performs k-fold cross-validation?

method='cv' with number=k performs k-fold CV. 'repeatedcv' repeats k-fold multiple times for more stable estimates.

Concept 2

Training and Comparing Models

# Train Random Forest with auto-tuning
rf_model <- train(
  Species ~ .,
  data       = train,
  method     = 'rf',
  trControl  = ctrl,
  tuneLength = 5,    # try 5 values of mtry automatically
  preProcess = c('center','scale'),
  metric     = 'Accuracy'
)

print(rf_model)            # best tune, CV accuracy
plot(rf_model)             # accuracy vs mtry
rf_model$bestTune          # best hyperparameter
varImp(rf_model) |> plot() # variable importance

# Predict on test set
preds <- predict(rf_model, test)
confusionMatrix(preds, test$Species)

# Compare multiple models
knn_model <- train(Species ~ ., data=train, method='knn', trControl=ctrl)
svm_model <- train(Species ~ ., data=train, method='svmRadial', trControl=ctrl)

results <- resamples(list(RF=rf_model, KNN=knn_model, SVM=svm_model))
summary(results)
dotplot(results)   # visual comparison
bwplot(results)    # boxplot comparison

# Train 4 models and compare accuracy
models <- list(
  KNN   = train(Species~., data=trainSet, method="knn",  trControl=ctrl),
  SVM   = train(Species~., data=trainSet, method="svmRadial", trControl=ctrl),
  RF    = train(Species~., data=trainSet, method="rf",   trControl=ctrl),
  LDA   = train(Species~., data=trainSet, method="lda",  trControl=ctrl)
)

# Resampling comparison
results <- resamples(models)
summary(results)

Data Frame Output

Model	Min Acc	Mean Acc	Max Acc	Mean Kappa
KNN	0.833	0.944	1.000	0.917
SVM	0.917	0.966	1.000	0.950
Random Forest	0.833	0.958	1.000	0.937
LDA	0.917	0.974	1.000	0.961

# Accuracy bar chart
dotplot(results, metric="Accuracy",
        main="10-Fold CV Accuracy — Model Comparison")

Chart Output

# Confusion matrix for best model (LDA) on test set
pred_lda <- predict(models$LDA, testSet)
confusionMatrix(pred_lda, testSet$Species)

Output

Confusion Matrix and Statistics

          Reference
Prediction setosa versicolor virginica
 setosa        10          0         0
 versicolor     0          9         1
 virginica      0          1         9

Overall Statistics
 Accuracy : 0.9333
 95% CI : (0.7793, 0.9918)
 Kappa : 0.9000

Sensitivity: setosa=1.00  versicolor=0.90  virginica=0.90
Specificity: setosa=1.00  versicolor=0.95  virginica=0.95

Chart Output

Solved Examples

Example 1 A confusion matrix for 3-class problem shows: Setosa: 10/10 correct, Versicolor: 8/10 correct (2 predicted as Virginica), Virginica: 9/10 correct. What is the overall accuracy?

Overall accuracy = correct / total = (10+8+9) / 30 = 27/30 = 90%. The confusion is between Versicolor and Virginica, which are harder to separate. Consider using probability thresholds or feature engineering to improve.

Self-Assessment (2 questions)

Q1. What does tuneLength=5 do in train()?

tuneLength tells caret to try that many values of the primary hyperparameter (e.g., mtry for RF, k for kNN). Use tuneGrid for manual specification.

Q2. Which function compares multiple trained models' cross-validation results?

resamples() collects the cross-validation results from multiple models trained with the same trainControl, enabling fair comparison via dotplot() and bwplot().

Time Series Analysis Decision Trees & Random Forests