Statistical Analysis in R
R Programming & Data Analytics / Statistical Analysis in R

Statistical Analysis in R

Intermediate 12 hrs 3 Concepts
M1

Descriptive Statistics

Concept 1

Summary Statistics

R has built-in functions for every descriptive statistic. The psych package adds skewness, kurtosis, and more.

R
x <- c(72,85,90,78,92,88,65,95,80,76)
mean(x); median(x); sd(x); var(x)
quantile(x, probs=c(0.25,0.5,0.75))
IQR(x)
library(psych); describe(x)  # comprehensive summary
R
# Summary stats on iris dataset
library(dplyr)
iris %>%
  group_by(Species) %>%
  summarise(
    n          = n(),
    mean_len   = round(mean(Sepal.Length), 2),
    sd_len     = round(sd(Sepal.Length), 2),
    median_len = median(Sepal.Length),
    min_len    = min(Sepal.Length),
    max_len    = max(Sepal.Length)
  )
Data Frame Output
Speciesnmean_lensd_lenmedian_lenmin_lenmax_len
setosa505.010.355.04.35.8
versicolor505.940.525.94.97.0
virginica506.590.646.54.97.9
R
# Histogram of Sepal Length distribution by species
ggplot(iris, aes(x = Sepal.Length, fill = Species)) +
  geom_histogram(binwidth = 0.3, alpha = 0.7, position = "identity") +
  labs(title = "Sepal Length Distribution by Species",
       x = "Sepal Length (cm)", y = "Frequency") +
  theme_minimal()
Chart Output
Solved Examples
Example 1 Apply the concept of Summary Statistics to a sample dataset. Show at least two approaches.

# See the code example above and adapt it to your data. # Always check your output with str() and head().

Self-Assessment (2 questions)
Q1. What is the primary purpose of summary statistics?
Q2. Which R package is most relevant for this topic?
M2

Hypothesis Testing

Concept 1

t-tests

t.test() performs one-sample (test against mu), two-sample (compare two groups), and paired tests.

R
t.test(x, mu=80)              # one-sample: is mean = 80?
t.test(group_a, group_b)       # two-sample
t.test(before, after, paired=TRUE)  # paired
# Always report: t-statistic, df, p-value, 95% CI, effect size
R
# Two-sample t-test: is setosa Sepal.Length different from versicolor?
t.test(
  iris$Sepal.Length[iris$Species == "setosa"],
  iris$Sepal.Length[iris$Species == "versicolor"]
)
Output
Welch Two Sample t-test

data:  setosa vs versicolor Sepal.Length
t = -10.521, df = 86.538, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.1057385 -0.7542615
mean of x mean of y 
    5.006     5.936
    
Conclusion: p < 0.001 — strong evidence the means differ significantly.
Solved Examples
Example 1 Apply the concept of t-tests to a sample dataset. Show at least two approaches.

# See the code example above and adapt it to your data. # Always check your output with str() and head().

Self-Assessment (2 questions)
Q1. What is the primary purpose of t-tests?
Q2. Which R package is most relevant for this topic?
Concept 2

ANOVA

aov() runs analysis of variance. Use TukeyHSD() for post-hoc pairwise comparisons.

R
model <- aov(score ~ method, data=df)
summary(model)       # F-statistic and p-value
TukeyHSD(model)      # which pairs differ significantly?
# Check assumptions: shapiro.test() for normality, bartlett.test() for equal variance
R
# One-way ANOVA: do all three species differ in Sepal.Length?
model_aov <- aov(Sepal.Length ~ Species, data = iris)
summary(model_aov)
Output
Df Sum Sq Mean Sq F value Pr(>F)    
Species       2  63.21  31.606   119.3 <2e-16 ***
Residuals   147  38.96   0.265                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

F = 119.3, p < 2e-16: at least one species mean differs significantly.
R
# Post-hoc Tukey HSD — which pairs differ?
TukeyHSD(model_aov)
Data Frame Output
ComparisonMean DiffLower CIUpper CIp adjusted
versicolor-setosa0.9300.7361.1240.000
virginica-setosa1.5821.3881.7760.000
virginica-versicolor0.6520.4580.8460.000
Solved Examples
Example 1 Apply the concept of ANOVA to a sample dataset. Show at least two approaches.

# See the code example above and adapt it to your data. # Always check your output with str() and head().

Self-Assessment (2 questions)
Q1. What is the primary purpose of anova?
Q2. Which R package is most relevant for this topic?
Advanced ggplot2 & plotly Regression Analysis