Regression Analysis
R Programming & Data Analytics / Regression Analysis

Regression Analysis

Intermediate 12 hrs 3 Concepts
M1

Linear Regression

Concept 1

Simple and Multiple Regression

lm() fits linear regression. summary() shows coefficients, R-squared, p-values, and residual std error.

R
model <- lm(mpg ~ wt + hp + cyl, data=mtcars)
summary(model)
confint(model)         # confidence intervals for coefficients
predict(model, newdata, interval='confidence')
R
# Simple linear regression: predict mpg from weight
model <- lm(mpg ~ wt, data = mtcars)
summary(model)
Output
Call: lm(formula = mpg ~ wt, data = mtcars)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.2851     1.8776   19.86   <2e-16 ***
wt           -5.3445     0.5591   -9.56  1.3e-10 ***

Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared: 0.7528,  Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

The R-squared of 0.75 means weight explains 75% of the variation in fuel efficiency. The negative coefficient (-5.34) means every 1,000 lb increase in weight reduces MPG by 5.3 miles.

R
# Visualise the regression line
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point(color = "#1d4ed8", size = 3, alpha = 0.8) +
  geom_smooth(method = "lm", color = "#ef4444", se = TRUE) +
  labs(title = "MPG vs Weight — mtcars Dataset",
       x = "Weight (1000 lbs)", y = "Miles Per Gallon") +
  theme_minimal()
Chart Output
R
# Coefficient table
coef_df <- data.frame(
  Term        = c("Intercept", "Weight"),
  Estimate    = c(37.29, -5.34),
  Std_Error   = c(1.88, 0.56),
  t_value     = c(19.86, -9.56),
  p_value     = c("< 2e-16 ***", "1.3e-10 ***")
)
print(coef_df)
Data Frame Output
TermEstimateStd Errort valuep value
Intercept37.2851.87819.86< 2e-16 ***
Weight (wt)-5.3450.559-9.561.3e-10 ***
Solved Examples
Example 1 Apply the concept of Simple and Multiple Regression to a sample dataset. Show at least two approaches.

# See the code example above and adapt it to your data. # Always check your output with str() and head().

Self-Assessment (2 questions)
Q1. What is the primary purpose of simple and multiple regression?
Q2. Which R package is most relevant for this topic?
Concept 2

Model Diagnostics

Always check the four diagnostic plots: Residuals vs Fitted, QQ plot, Scale-Location, Cook's Distance.

R
par(mfrow=c(2,2)); plot(model)    # four diagnostic plots
library(car); vif(model)           # check multicollinearity (VIF>5 is concern)
residuals(model)                    # raw residuals
rstandard(model)                    # standardised residuals
Solved Examples
Example 1 Apply the concept of Model Diagnostics to a sample dataset. Show at least two approaches.

# See the code example above and adapt it to your data. # Always check your output with str() and head().

Self-Assessment (2 questions)
Q1. What is the primary purpose of model diagnostics?
Q2. Which R package is most relevant for this topic?
M2

Logistic Regression

Concept 1

Binary Classification with glm()

For binary outcomes, use glm() with family=binomial. Interpret coefficients as log-odds; exponentiate for odds ratios.

R
log_model <- glm(pass ~ score + attendance, data=df, family=binomial)
summary(log_model)
exp(coef(log_model))           # odds ratios
predict(log_model, type='response')  # predicted probabilities (0-1)
Solved Examples
Example 1 Apply the concept of Binary Classification with glm() to a sample dataset. Show at least two approaches.

# See the code example above and adapt it to your data. # Always check your output with str() and head().

Self-Assessment (2 questions)
Q1. What is the primary purpose of binary classification with glm()?
Q2. Which R package is most relevant for this topic?
Statistical Analysis in R Time Series Analysis