Statistical Analysis in R
Descriptive Statistics
Summary Statistics
R has built-in functions for every descriptive statistic. The psych package adds skewness, kurtosis, and more.
x <- c(72,85,90,78,92,88,65,95,80,76)
mean(x); median(x); sd(x); var(x)
quantile(x, probs=c(0.25,0.5,0.75))
IQR(x)
library(psych); describe(x) # comprehensive summary
# Summary stats on iris dataset
library(dplyr)
iris %>%
group_by(Species) %>%
summarise(
n = n(),
mean_len = round(mean(Sepal.Length), 2),
sd_len = round(sd(Sepal.Length), 2),
median_len = median(Sepal.Length),
min_len = min(Sepal.Length),
max_len = max(Sepal.Length)
)
| Species | n | mean_len | sd_len | median_len | min_len | max_len |
|---|---|---|---|---|---|---|
| setosa | 50 | 5.01 | 0.35 | 5.0 | 4.3 | 5.8 |
| versicolor | 50 | 5.94 | 0.52 | 5.9 | 4.9 | 7.0 |
| virginica | 50 | 6.59 | 0.64 | 6.5 | 4.9 | 7.9 |
# Histogram of Sepal Length distribution by species
ggplot(iris, aes(x = Sepal.Length, fill = Species)) +
geom_histogram(binwidth = 0.3, alpha = 0.7, position = "identity") +
labs(title = "Sepal Length Distribution by Species",
x = "Sepal Length (cm)", y = "Frequency") +
theme_minimal()
# See the code example above and adapt it to your data. # Always check your output with str() and head().
Hypothesis Testing
t-tests
t.test() performs one-sample (test against mu), two-sample (compare two groups), and paired tests.
t.test(x, mu=80) # one-sample: is mean = 80?
t.test(group_a, group_b) # two-sample
t.test(before, after, paired=TRUE) # paired
# Always report: t-statistic, df, p-value, 95% CI, effect size
# Two-sample t-test: is setosa Sepal.Length different from versicolor?
t.test(
iris$Sepal.Length[iris$Species == "setosa"],
iris$Sepal.Length[iris$Species == "versicolor"]
)
Welch Two Sample t-test
data: setosa vs versicolor Sepal.Length
t = -10.521, df = 86.538, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.1057385 -0.7542615
mean of x mean of y
5.006 5.936
Conclusion: p < 0.001 — strong evidence the means differ significantly.# See the code example above and adapt it to your data. # Always check your output with str() and head().
ANOVA
aov() runs analysis of variance. Use TukeyHSD() for post-hoc pairwise comparisons.
model <- aov(score ~ method, data=df)
summary(model) # F-statistic and p-value
TukeyHSD(model) # which pairs differ significantly?
# Check assumptions: shapiro.test() for normality, bartlett.test() for equal variance
# One-way ANOVA: do all three species differ in Sepal.Length?
model_aov <- aov(Sepal.Length ~ Species, data = iris)
summary(model_aov)
Df Sum Sq Mean Sq F value Pr(>F) Species 2 63.21 31.606 119.3 <2e-16 *** Residuals 147 38.96 0.265 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 F = 119.3, p < 2e-16: at least one species mean differs significantly.
# Post-hoc Tukey HSD — which pairs differ?
TukeyHSD(model_aov)
| Comparison | Mean Diff | Lower CI | Upper CI | p adjusted |
|---|---|---|---|---|
| versicolor-setosa | 0.930 | 0.736 | 1.124 | 0.000 |
| virginica-setosa | 1.582 | 1.388 | 1.776 | 0.000 |
| virginica-versicolor | 0.652 | 0.458 | 0.846 | 0.000 |
# See the code example above and adapt it to your data. # Always check your output with str() and head().