Statistical Analysis in R

Intermediate 12 hrs 3 Concepts

Your Learning Map

📌 You already know

You can summarise and visualise a dataset.

🎯 You'll learn here

Descriptive statistics, and inferential tests — t-tests and ANOVA — to compare groups.

🌍 Where it's used

Deciding whether a difference is real (A/B tests, trials, experiments) rests on these tests.

🔗 Unlocks next

Leads into Regression, which models relationships rather than just comparing groups.

Descriptive Statistics

Concept 1

Summary Statistics

R has built-in functions for every descriptive statistic. The psych package adds skewness, kurtosis, and more.

x <- c(72,85,90,78,92,88,65,95,80,76)
mean(x); median(x); sd(x); var(x)
quantile(x, probs=c(0.25,0.5,0.75))
IQR(x)
library(psych); describe(x)  # comprehensive summary

# Summary stats on iris dataset
library(dplyr)
iris %>%
  group_by(Species) %>%
  summarise(
    n          = n(),
    mean_len   = round(mean(Sepal.Length), 2),
    sd_len     = round(sd(Sepal.Length), 2),
    median_len = median(Sepal.Length),
    min_len    = min(Sepal.Length),
    max_len    = max(Sepal.Length)
  )

Data Frame Output

Species	n	mean_len	sd_len	median_len	min_len	max_len
setosa	50	5.01	0.35	5.0	4.3	5.8
versicolor	50	5.94	0.52	5.9	4.9	7.0
virginica	50	6.59	0.64	6.5	4.9	7.9

# Histogram of Sepal Length distribution by species
ggplot(iris, aes(x = Sepal.Length, fill = Species)) +
  geom_histogram(binwidth = 0.3, alpha = 0.7, position = "identity") +
  labs(title = "Sepal Length Distribution by Species",
       x = "Sepal Length (cm)", y = "Frequency") +
  theme_minimal()

Chart Output

R — Histogram of MPG LIVE READY

summary(mtcars$mpg)
hist(mtcars$mpg, col = "#3b82f6", breaks = 8,
     main = "Distribution of MPG", xlab = "Miles per gallon")

Output below is verified. Click to run real R in your browser (first run loads ~20 MB once).

Output (verified)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  10.40   15.43   19.20   20.09   22.80   33.90

Solved Examples

Example 1 Apply the concept of Summary Statistics to a sample dataset. Show at least two approaches.

# See the code example above and adapt it to your data. # Always check your output with str() and head().

Self-Assessment (2 questions)

Q1. Which function returns the min, quartiles, median, mean and max of a numeric vector?

summary() gives the five-number summary plus the mean for a numeric vector.

Q2. The (sample) standard deviation in R is computed with:

sd() returns the standard deviation; var() returns the variance (its square).

Hypothesis Testing

Concept 1

t-tests

t.test() performs one-sample (test against mu), two-sample (compare two groups), and paired tests.

t.test(x, mu=80)              # one-sample: is mean = 80?
t.test(group_a, group_b)       # two-sample
t.test(before, after, paired=TRUE)  # paired
# Always report: t-statistic, df, p-value, 95% CI, effect size

# Two-sample t-test: is setosa Sepal.Length different from versicolor?
t.test(
  iris$Sepal.Length[iris$Species == "setosa"],
  iris$Sepal.Length[iris$Species == "versicolor"]
)

Output

Welch Two Sample t-test

data:  setosa vs versicolor Sepal.Length
t = -10.521, df = 86.538, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.1057385 -0.7542615
mean of x mean of y 
    5.006     5.936
    
Conclusion: p < 0.001 — strong evidence the means differ significantly.

Solved Examples

Example 1 Apply the concept of t-tests to a sample dataset. Show at least two approaches.

# See the code example above and adapt it to your data. # Always check your output with str() and head().

Self-Assessment (2 questions)

Q1. A two-sample t-test is used to:

A two-sample t-test assesses whether two group means differ significantly.

Q2. If a t-test gives p = 0.002 at the 0.05 level, you should:

p (0.002) < 0.05, so the difference is statistically significant - reject the null.

Concept 2

ANOVA

aov() runs analysis of variance. Use TukeyHSD() for post-hoc pairwise comparisons.

model <- aov(score ~ method, data=df)
summary(model)       # F-statistic and p-value
TukeyHSD(model)      # which pairs differ significantly?
# Check assumptions: shapiro.test() for normality, bartlett.test() for equal variance

# One-way ANOVA: do all three species differ in Sepal.Length?
model_aov <- aov(Sepal.Length ~ Species, data = iris)
summary(model_aov)

Output

Df Sum Sq Mean Sq F value Pr(>F)    
Species       2  63.21  31.606   119.3 <2e-16 ***
Residuals   147  38.96   0.265                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

F = 119.3, p < 2e-16: at least one species mean differs significantly.

# Post-hoc Tukey HSD — which pairs differ?
TukeyHSD(model_aov)

Data Frame Output

Comparison	Mean Diff	Lower CI	Upper CI
versicolor-setosa	0.930	0.736	1.124
virginica-setosa	1.582	1.388	1.776
virginica-versicolor	0.652	0.458	0.846

Solved Examples

Example 1 Apply the concept of ANOVA to a sample dataset. Show at least two approaches.

# See the code example above and adapt it to your data. # Always check your output with str() and head().

Self-Assessment (2 questions)

Q1. ANOVA compares the means of:

ANOVA tests whether three or more group means differ (a t-test handles just two).

Q2. In R, a one-way ANOVA can be fitted with:

aov() (or lm() followed by anova()) fits an analysis-of-variance model.

Advanced ggplot2 & plotly Regression Analysis