Regression Analysis
Linear Regression
Simple and Multiple Regression
lm() fits linear regression. summary() shows coefficients, R-squared, p-values, and residual std error.
model <- lm(mpg ~ wt + hp + cyl, data=mtcars)
summary(model)
confint(model) # confidence intervals for coefficients
predict(model, newdata, interval='confidence')
# Simple linear regression: predict mpg from weight
model <- lm(mpg ~ wt, data = mtcars)
summary(model)
Call: lm(formula = mpg ~ wt, data = mtcars)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.86 <2e-16 ***
wt -5.3445 0.5591 -9.56 1.3e-10 ***
Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10The R-squared of 0.75 means weight explains 75% of the variation in fuel efficiency. The negative coefficient (-5.34) means every 1,000 lb increase in weight reduces MPG by 5.3 miles.
# Visualise the regression line
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point(color = "#1d4ed8", size = 3, alpha = 0.8) +
geom_smooth(method = "lm", color = "#ef4444", se = TRUE) +
labs(title = "MPG vs Weight — mtcars Dataset",
x = "Weight (1000 lbs)", y = "Miles Per Gallon") +
theme_minimal()
# Coefficient table
coef_df <- data.frame(
Term = c("Intercept", "Weight"),
Estimate = c(37.29, -5.34),
Std_Error = c(1.88, 0.56),
t_value = c(19.86, -9.56),
p_value = c("< 2e-16 ***", "1.3e-10 ***")
)
print(coef_df)
| Term | Estimate | Std Error | t value | p value |
|---|---|---|---|---|
| Intercept | 37.285 | 1.878 | 19.86 | < 2e-16 *** |
| Weight (wt) | -5.345 | 0.559 | -9.56 | 1.3e-10 *** |
# See the code example above and adapt it to your data. # Always check your output with str() and head().
Model Diagnostics
Always check the four diagnostic plots: Residuals vs Fitted, QQ plot, Scale-Location, Cook's Distance.
par(mfrow=c(2,2)); plot(model) # four diagnostic plots
library(car); vif(model) # check multicollinearity (VIF>5 is concern)
residuals(model) # raw residuals
rstandard(model) # standardised residuals
# See the code example above and adapt it to your data. # Always check your output with str() and head().
Logistic Regression
Binary Classification with glm()
For binary outcomes, use glm() with family=binomial. Interpret coefficients as log-odds; exponentiate for odds ratios.
log_model <- glm(pass ~ score + attendance, data=df, family=binomial)
summary(log_model)
exp(coef(log_model)) # odds ratios
predict(log_model, type='response') # predicted probabilities (0-1)
# See the code example above and adapt it to your data. # Always check your output with str() and head().