Text Mining & NLP with R
R Programming & Data Analytics / Text Mining & NLP with R

Text Mining & NLP with R

Advanced 10 hrs 2 Concepts
M1

tidytext Approach

Concept 1

Tokenisation and Stop Words

unnest_tokens() splits text into one-word-per-row (tidy) format. anti_join(stop_words) removes common words like 'the','a','is'.

R
library(tidytext)
tokens <- reviews |>
  unnest_tokens(word, text) |>
  anti_join(stop_words, by='word')
tokens |> count(word, sort=TRUE) |> head(20)
Solved Examples
Example 1 Apply the concept of Tokenisation and Stop Words to a sample dataset. Show at least two approaches.

# See the code example above and adapt it to your data. # Always check your output with str() and head().

Self-Assessment (2 questions)
Q1. What is the primary purpose of tokenisation and stop words?
Q2. Which R package is most relevant for this topic?
Concept 2

TF-IDF and Sentiment

TF-IDF (term frequency–inverse document frequency) identifies important words per document. Sentiment analysis assigns polarity scores.

R
# TF-IDF
tokens |> count(doc_id, word) |>
  bind_tf_idf(word, doc_id, n) |> arrange(desc(tf_idf))
# Sentiment (AFINN scores: -5 to +5)
tokens |> inner_join(get_sentiments('afinn'), by='word') |>
  group_by(doc_id) |> summarise(sentiment=sum(value))
Solved Examples
Example 1 Apply the concept of TF-IDF and Sentiment to a sample dataset. Show at least two approaches.

# See the code example above and adapt it to your data. # Always check your output with str() and head().

Self-Assessment (2 questions)
Q1. What is the primary purpose of tf-idf and sentiment?
Q2. Which R package is most relevant for this topic?
Unsupervised Learning & PCA Working with Databases in R