Text Mining & NLP with R
tidytext Approach
Tokenisation and Stop Words
unnest_tokens() splits text into one-word-per-row (tidy) format. anti_join(stop_words) removes common words like 'the','a','is'.
library(tidytext)
tokens <- reviews |>
unnest_tokens(word, text) |>
anti_join(stop_words, by='word')
tokens |> count(word, sort=TRUE) |> head(20)
# See the code example above and adapt it to your data. # Always check your output with str() and head().
TF-IDF and Sentiment
TF-IDF (term frequency–inverse document frequency) identifies important words per document. Sentiment analysis assigns polarity scores.
# TF-IDF
tokens |> count(doc_id, word) |>
bind_tf_idf(word, doc_id, n) |> arrange(desc(tf_idf))
# Sentiment (AFINN scores: -5 to +5)
tokens |> inner_join(get_sentiments('afinn'), by='word') |>
group_by(doc_id) |> summarise(sentiment=sum(value))
# See the code example above and adapt it to your data. # Always check your output with str() and head().