Data Wrangling with dplyr

Intermediate 15 hrs 4 Concepts

Your Learning Map

📌 You already know

You can load a dataset into a data frame.

🎯 You'll learn here

The five dplyr verbs — filter, select, mutate, arrange, summarize — plus group_by and the pipe %>%.

🌍 Where it's used

The daily bread of analytics: turning a raw table into the one number or chart a decision needs.

🔗 Unlocks next

Clean, grouped data flows straight into ggplot2 for visualisation.

The Six Core Verbs

Concept 1

filter() and select()

filter() keeps rows matching a condition. select() chooses columns.

library(dplyr)

students <- tibble(
  name    = c('Aarav','Kavya','Rohan','Ananya','Vivaan'),
  subject = c('Math','Math','Science','Math','Science'),
  score   = c(92, 88, 75, 95, 82),
  year    = c(11, 11, 10, 11, 10)
)

# filter() — multiple conditions use & (AND) or | (OR)
students |> filter(score >= 90)
students |> filter(subject == 'Math', year == 11)   # AND
students |> filter(score > 80 | year == 10)          # OR
students |> filter(name %in% c('Aarav','Kavya'))     # %in%
students |> filter(!is.na(score))                    # not NA

# select() — choose, drop, or rename columns
students |> select(name, score)
students |> select(-year)                  # drop year
students |> select(starts_with('s'))       # helper: starts_with
students |> select(where(is.numeric))      # all numeric cols
students |> select(student=name, marks=score)  # rename while selecting

# filter() — keep rows matching a condition
library(dplyr)
high_mpg <- filter(mpg, hwy > 30, cyl == 4)
select(high_mpg, manufacturer, model, hwy, cyl)

Data Frame Output

manufacturer	model	hwy	cyl
honda	civic	36	4
honda	civic	36	4
toyota	corolla	35	4
volkswagen	jetta	34	4
toyota	corolla	34	4
volkswagen	new beetle	44	4

Solved Examples

Example 1 From students data, select only Math students in year 11 and show only their name and score.

students |>
  filter(subject == 'Math', year == 11) |>
  select(name, score)
# Aarav  92
# Kavya  88
# Ananya 95

Example 2 Keep only rows where subject is either 'Math' or 'Physics'.

df |> filter(subject %in% c('Math', 'Physics'))
# Equivalent to:
df |> filter(subject == 'Math' | subject == 'Physics')
# %in% is cleaner for 3+ values

Self-Assessment (3 questions)

Q1. Which dplyr verb removes columns?

select() with a minus sign drops columns: select(-col_to_drop). For multiple: select(-c(col1, col2)).

Q2. What does filter(df, !is.na(score)) do?

is.na() returns TRUE for NA values. The ! negates it, so we keep rows where score is NOT NA.

Q3. Which select() helper picks columns that match a pattern?

dplyr provides starts_with(), ends_with(), contains(), matches() (regex), and where() (function predicate).

Concept 2

mutate(), arrange(), summarise(), group_by()

mutate() adds/transforms columns. arrange() sorts. summarise() collapses to aggregated rows. group_by() makes operations per-group.

# mutate() — add or transform columns
students |> mutate(
  grade     = if_else(score >= 90, 'A', if_else(score >= 80, 'B', 'C')),
  score_pct = score / 100,
  rank      = rank(desc(score))   # rank within mutate
)

# arrange() — sort ascending; desc() for descending
students |> arrange(desc(score), name)   # score descending, name alphabetical

# summarise() — aggregate to one row
students |> summarise(
  n          = n(),
  mean_score = mean(score),
  max_score  = max(score),
  pass_rate  = mean(score >= 80)
)

# group_by() + summarise() — aggregate per group
students |>
  group_by(subject, year) |>
  summarise(
    mean_score = mean(score),
    n          = n(),
    .groups    = 'drop'   # always add this to avoid warnings
  )

# group_by() + mutate() — per-group transform (keeps all rows!)
students |>
  group_by(subject) |>
  mutate(subject_rank = rank(desc(score)))

# group_by + summarise — mean MPG per car class
mpg %>%
  group_by(class) %>%
  summarise(
    avg_hwy  = round(mean(hwy), 1),
    avg_cty  = round(mean(cty), 1),
    count    = n()
  ) %>%
  arrange(desc(avg_hwy))

Data Frame Output

class	avg_hwy	avg_cty	count
subcompact	28.1	20.4	35
compact	28.3	20.1	47
midsize	27.3	18.8	41
minivan	22.4	15.8	11
2seater	24.8	15.4	5
pickup	16.9	13.0	33
suv	18.1	13.5	62

# mutate() — add derived columns
mpg %>%
  mutate(
    efficiency_ratio = round(hwy / cty, 2),
    size_class       = if_else(displ > 3.5, "Large", "Small")
  ) %>%
  select(model, displ, cty, hwy, efficiency_ratio, size_class) %>%
  head(6)

Chart Output

R — Group means, computed live LIVE READY

aggregate(mpg ~ cyl, data = mtcars, FUN = mean)

Output below is verified. Click to run real R in your browser (first run loads ~20 MB once).

Output (verified)

  cyl      mpg
1   4 26.66364
2   6 19.74286
3   8 15.10000

Solved Examples

Example 1 For each subject, calculate the number of students, mean score, and percentage scoring >= 80.

students |>
  group_by(subject) |>
  summarise(
    n         = n(),
    mean_score = round(mean(score), 1),
    pass_rate  = paste0(round(mean(score >= 80)*100), '%'),
    .groups   = 'drop'
  )

Example 2 Add a column showing each student's rank within their subject, ordered by score descending.

students |>
  group_by(subject) |>
  mutate(rank = rank(desc(score))) |>
  arrange(subject, rank)

Self-Assessment (2 questions)

Q1. What is the difference between summarise() and mutate() after group_by()?

After group_by, summarise() collapses each group to one row. mutate() keeps all rows but computes values group-by-group.

Q2. Which argument should you add to summarise() to suppress grouping warnings?

Adding .groups='drop' to summarise() explicitly removes grouping from the result, preventing the 'Adding missing grouping variables' message.

Joins and Advanced Operations

Concept 3

Joining Datasets

dplyr provides SQL-style joins to combine two tables by matching key columns.

students <- tibble(id=1:4, name=c('Aarav','Kavya','Rohan','Ananya'))
enrolled <- tibble(id=c(1,2,4,5), course=c('R','Python','SQL','Julia'))

# inner_join — only rows with match in BOTH tables
inner_join(students, enrolled, by='id')
# Aarav-R, Kavya-Python, Ananya-SQL (Rohan dropped, 5 dropped)

# left_join — ALL rows from left; NA if no match on right
left_join(students, enrolled, by='id')
# Aarav-R, Kavya-Python, Rohan-NA, Ananya-SQL

# right_join — ALL rows from right
right_join(students, enrolled, by='id')

# full_join — ALL rows from both
full_join(students, enrolled, by='id')

# anti_join — rows in left NOT in right (non-enrolled students)
anti_join(students, enrolled, by='id')
# Rohan only

# semi_join — rows in left that HAVE a match in right (no extra cols)
semi_join(students, enrolled, by='id')
# Aarav, Kavya, Ananya (no extra columns from enrolled)

# Different column names
left_join(students, enrolled, by=c('id'='student_id'))

# inner_join — combine orders with customer info
orders     <- data.frame(id=c(1,2,3,4), cust_id=c(10,11,10,12), amount=c(250,180,320,95))
customers  <- data.frame(cust_id=c(10,11,13), name=c("Aarav","Kavya","Rohan"), city=c("Mumbai","Delhi","Pune"))
inner_join(orders, customers, by="cust_id")

Data Frame Output

id	cust_id	amount	name	city
1	10	250	Aarav	Mumbai
3	10	320	Aarav	Mumbai
2	11	180	Kavya	Delhi

Solved Examples

Example 1 You have a products table and sales table. Find products that have never been sold.

anti_join(products, sales, by='product_id')
# Returns products with no matching product_id in sales
# Perfect for finding 'dead stock'

Self-Assessment (2 questions)

Q1. Which join keeps all rows from the LEFT table?

left_join keeps all left rows, filling in NA for right-table columns where there's no match. This is the most commonly used join in practice.

Q2. What does anti_join() return?

anti_join is perfect for finding 'what's missing' — products never ordered, students not enrolled, records without a match.

Concept 4

across() — Apply to Multiple Columns

across() inside mutate() or summarise() applies a function to multiple columns at once, eliminating repetition.

library(dplyr)

# Standardise ALL numeric columns
df |> mutate(across(where(is.numeric), scale))

# Round specific columns to 2 decimal places
df |> mutate(across(c(math, english, science), round, digits=2))

# Multiple functions at once
df |> summarise(across(
  c(math, english, science),
  list(mean=mean, sd=sd, max=max),
  na.rm=TRUE
))
# Creates: math_mean, math_sd, math_max, english_mean, etc.

# Custom anonymous function
df |> mutate(across(where(is.character), ~ toupper(trimws(.))))

# rename_with() — transform column names consistently
df |> rename_with(tolower)              # all to lowercase
df |> rename_with(~ gsub(' ','_',.))

Solved Examples

Example 1 Apply percent_rank() to all numeric columns simultaneously.

library(dplyr)
df |> mutate(across(where(is.numeric), percent_rank))
# percent_rank() gives 0-1 rank for each value
# Applied to ALL numeric columns in one line

Self-Assessment (1 questions)

Q1. What does across(where(is.numeric), mean) do inside summarise()?

across() with where(is.numeric) selects all columns where is.numeric() returns TRUE, then applies mean() to each.

Data Import & Export Data Reshaping with tidyr