Regression Analysis

Intermediate 12 hrs 3 Concepts

Your Learning Map

📌 You already know

You understand means, variance and hypothesis tests.

🎯 You'll learn here

Fitting and reading linear and multiple regression, diagnostics, and logistic regression with glm().

🌍 Where it's used

Predicting prices, demand or risk, and explaining what drives an outcome — the workhorse of analytics.

🔗 Unlocks next

Regression is the gateway to machine learning.

Linear Regression

Concept 1

Simple and Multiple Regression

lm() fits linear regression. summary() shows coefficients, R-squared, p-values, and residual std error.

model <- lm(mpg ~ wt + hp + cyl, data=mtcars)
summary(model)
confint(model)         # confidence intervals for coefficients
predict(model, newdata, interval='confidence')

# Simple linear regression: predict mpg from weight
model <- lm(mpg ~ wt, data = mtcars)
summary(model)

Output

Call: lm(formula = mpg ~ wt, data = mtcars)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.2851     1.8776   19.86   <2e-16 ***
wt           -5.3445     0.5591   -9.56  1.3e-10 ***

Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared: 0.7528,  Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

The R-squared of 0.75 means weight explains 75% of the variation in fuel efficiency. The negative coefficient (-5.34) means every 1,000 lb increase in weight reduces MPG by 5.3 miles.

# Visualise the regression line
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point(color = "#1d4ed8", size = 3, alpha = 0.8) +
  geom_smooth(method = "lm", color = "#ef4444", se = TRUE) +
  labs(title = "MPG vs Weight — mtcars Dataset",
       x = "Weight (1000 lbs)", y = "Miles Per Gallon") +
  theme_minimal()

Chart Output

# Coefficient table
coef_df <- data.frame(
  Term        = c("Intercept", "Weight"),
  Estimate    = c(37.29, -5.34),
  Std_Error   = c(1.88, 0.56),
  t_value     = c(19.86, -9.56),
  p_value     = c("< 2e-16 ***", "1.3e-10 ***")
)
print(coef_df)

Data Frame Output

Term	Estimate	Std Error	t value	p value
Intercept	37.285	1.878	19.86	< 2e-16 ***
Weight (wt)	-5.345	0.559	-9.56	1.3e-10 ***

R — Scatter + regression line LIVE READY

model <- lm(mpg ~ wt, data = mtcars)
coef(model)
plot(mtcars$wt, mtcars$mpg, pch = 19, col = "#64748b",
     xlab = "Weight (1000 lbs)", ylab = "MPG", main = "MPG vs Weight")
abline(model, col = "#ef4444", lwd = 2)

Output below is verified. Click to run real R in your browser (first run loads ~20 MB once).

Output (verified)

(Intercept)          wt 
  37.285126   -5.344472

Solved Examples

Example 1 Apply the concept of Simple and Multiple Regression to a sample dataset. Show at least two approaches.

# See the code example above and adapt it to your data. # Always check your output with str() and head().

Self-Assessment (2 questions)

Q1. Which function fits a linear regression model in R?

lm() fits ordinary linear-regression models, e.g. lm(y ~ x, data).

Q2. In the formula mpg ~ wt + hp, mpg is the:

The left side of ~ is the response; the right side lists the predictors.

Concept 2

Model Diagnostics

Always check the four diagnostic plots: Residuals vs Fitted, QQ plot, Scale-Location, Cook's Distance.

par(mfrow=c(2,2)); plot(model)    # four diagnostic plots
library(car); vif(model)           # check multicollinearity (VIF>5 is concern)
residuals(model)                    # raw residuals
rstandard(model)                    # standardised residuals

Solved Examples

Example 1 Apply the concept of Model Diagnostics to a sample dataset. Show at least two approaches.

# See the code example above and adapt it to your data. # Always check your output with str() and head().

Self-Assessment (2 questions)

Q1. R-squared measures:

R-squared is the fraction of the response's variance explained by the model (0-1).

Q2. A residuals-vs-fitted plot is used to check:

Patternless residuals support the linearity and constant-variance assumptions.

Logistic Regression

Concept 1

Binary Classification with glm()

For binary outcomes, use glm() with family=binomial. Interpret coefficients as log-odds; exponentiate for odds ratios.

log_model <- glm(pass ~ score + attendance, data=df, family=binomial)
summary(log_model)
exp(coef(log_model))           # odds ratios
predict(log_model, type='response')  # predicted probabilities (0-1)

Solved Examples

Example 1 Apply the concept of Binary Classification with glm() to a sample dataset. Show at least two approaches.

# See the code example above and adapt it to your data. # Always check your output with str() and head().

Self-Assessment (2 questions)

Q1. Logistic regression is fitted with glm() and which family?

glm(y ~ x, family = binomial) fits logistic regression for a binary outcome.

Q2. Logistic regression predicts:

It outputs probabilities (0-1) that are then thresholded into classes.

Statistical Analysis in R Time Series Analysis