Machine Learning with caret
Classification and Cross-Validation
Data Preparation and Cross-Validation
Before training, always: split data, define cross-validation, and preprocess.
library(caret)
data(iris)
set.seed(42)
# Stratified split (preserves class proportions)
idx <- createDataPartition(iris$Species, p=0.8, list=FALSE)
train <- iris[idx, ]
test <- iris[-idx, ]
# Cross-validation strategy
ctrl <- trainControl(
method = 'cv', # k-fold
number = 10, # 10-fold
classProbs = TRUE, # needed for ROC
summaryFunction = multiClassSummary,
savePredictions = 'final'
)
# Preprocessing within caret (applied to train AND test correctly)
# preProcess = c('center','scale','nzv','knnImpute')
# Load, split, set up 10-fold cross-validation
library(caret)
data(iris)
set.seed(42)
trainIdx <- createDataPartition(iris$Species, p=0.8, list=FALSE)
trainSet <- iris[trainIdx, ]
testSet <- iris[-trainIdx, ]
ctrl <- trainControl(method="cv", number=10, savePredictions=TRUE)
cat("Training samples:", nrow(trainSet), "\n")
cat("Test samples: ", nrow(testSet), "\n")
Training samples: 120 Test samples: 30 Cross-validation: 10-fold (12 samples per fold) Each fold trains on 108 samples, validates on 12.
createDataPartition() preserves the class proportions from the full dataset in both train and test sets. E.g., if 10% of data is class 'Rare', both train and test will have ~10% 'Rare'. Random sample() could create train/test with very different class distributions, leading to misleading accuracy metrics.
Training and Comparing Models
# Train Random Forest with auto-tuning
rf_model <- train(
Species ~ .,
data = train,
method = 'rf',
trControl = ctrl,
tuneLength = 5, # try 5 values of mtry automatically
preProcess = c('center','scale'),
metric = 'Accuracy'
)
print(rf_model) # best tune, CV accuracy
plot(rf_model) # accuracy vs mtry
rf_model$bestTune # best hyperparameter
varImp(rf_model) |> plot() # variable importance
# Predict on test set
preds <- predict(rf_model, test)
confusionMatrix(preds, test$Species)
# Compare multiple models
knn_model <- train(Species ~ ., data=train, method='knn', trControl=ctrl)
svm_model <- train(Species ~ ., data=train, method='svmRadial', trControl=ctrl)
results <- resamples(list(RF=rf_model, KNN=knn_model, SVM=svm_model))
summary(results)
dotplot(results) # visual comparison
bwplot(results) # boxplot comparison
# Train 4 models and compare accuracy
models <- list(
KNN = train(Species~., data=trainSet, method="knn", trControl=ctrl),
SVM = train(Species~., data=trainSet, method="svmRadial", trControl=ctrl),
RF = train(Species~., data=trainSet, method="rf", trControl=ctrl),
LDA = train(Species~., data=trainSet, method="lda", trControl=ctrl)
)
# Resampling comparison
results <- resamples(models)
summary(results)
| Model | Min Acc | Mean Acc | Max Acc | Mean Kappa |
|---|---|---|---|---|
| KNN | 0.833 | 0.944 | 1.000 | 0.917 |
| SVM | 0.917 | 0.966 | 1.000 | 0.950 |
| Random Forest | 0.833 | 0.958 | 1.000 | 0.937 |
| LDA | 0.917 | 0.974 | 1.000 | 0.961 |
# Accuracy bar chart
dotplot(results, metric="Accuracy",
main="10-Fold CV Accuracy — Model Comparison")
# Confusion matrix for best model (LDA) on test set
pred_lda <- predict(models$LDA, testSet)
confusionMatrix(pred_lda, testSet$Species)
Confusion Matrix and Statistics
Reference
Prediction setosa versicolor virginica
setosa 10 0 0
versicolor 0 9 1
virginica 0 1 9
Overall Statistics
Accuracy : 0.9333
95% CI : (0.7793, 0.9918)
Kappa : 0.9000
Sensitivity: setosa=1.00 versicolor=0.90 virginica=0.90
Specificity: setosa=1.00 versicolor=0.95 virginica=0.95Overall accuracy = correct / total = (10+8+9) / 30 = 27/30 = 90%. The confusion is between Versicolor and Virginica, which are harder to separate. Consider using probability thresholds or feature engineering to improve.