Text Mining & NLP with R

Advanced 10 hrs 2 Concepts

Your Learning Map

📌 You already know

You can manipulate vectors and tables.

🎯 You'll learn here

Turning text into data — tokenisation, stop words, TF-IDF and sentiment.

🌍 Where it's used

Reviews, tweets, support tickets and survey free-text become analysable numbers.

🔗 Unlocks next

Feeds models from the ML chapters; results are often served in a Shiny app.

tidytext Approach

Concept 1

Tokenisation and Stop Words

unnest_tokens() splits text into one-word-per-row (tidy) format. anti_join(stop_words) removes common words like 'the','a','is'.

library(tidytext)
tokens <- reviews |>
  unnest_tokens(word, text) |>
  anti_join(stop_words, by='word')
tokens |> count(word, sort=TRUE) |> head(20)

R — Count words LIVE READY

text <- "the cat sat on the mat the cat ran"
words <- strsplit(tolower(text), " ")[[1]]
sort(table(words), decreasing = TRUE)

Output below is verified. Click to run real R in your browser (first run loads ~20 MB once).

Output (verified)

words
the cat mat  on ran sat 
  3   2   1   1   1   1

Solved Examples

Example 1 Apply the concept of Tokenisation and Stop Words to a sample dataset. Show at least two approaches.

# See the code example above and adapt it to your data. # Always check your output with str() and head().

Self-Assessment (2 questions)

Q1. Tokenisation means:

Tokenisation breaks a document into tokens (usually words) for analysis.

Q2. Stop words are:

Stop words carry little meaning and are usually filtered out before analysis.

Concept 2

TF-IDF and Sentiment

TF-IDF (term frequency–inverse document frequency) identifies important words per document. Sentiment analysis assigns polarity scores.

# TF-IDF
tokens |> count(doc_id, word) |>
  bind_tf_idf(word, doc_id, n) |> arrange(desc(tf_idf))
# Sentiment (AFINN scores: -5 to +5)
tokens |> inner_join(get_sentiments('afinn'), by='word') |>
  group_by(doc_id) |> summarise(sentiment=sum(value))

Solved Examples

Example 1 Apply the concept of TF-IDF and Sentiment to a sample dataset. Show at least two approaches.

# See the code example above and adapt it to your data. # Always check your output with str() and head().

Self-Assessment (2 questions)

Q1. TF-IDF gives a HIGH weight to a word that is:

TF-IDF up-weights terms distinctive to a document and down-weights ubiquitous ones.

Q2. Sentiment analysis aims to determine:

Sentiment analysis scores text as positive, negative or neutral.

Unsupervised Learning & PCA Working with Databases in R