There are quite a number of R packages for doing text mining. For example,
See here for a nice comparison between the packages: https://quanteda.io/articles/pkgdown/comparison.html
We will focus on tidytext
for its simplicity.
So what do we do with text mining?
(from https://www.linguamatics.com/) > Text mining (also referred to as text analytics) is an artificial intelligence (AI) technology that uses natural language processing (NLP) to transform the free (unstructured) text in documents and databases into normalized, structured data suitable for analysis or to drive machine learning (ML) algorithms.
(Disclaimer: I am not an expert of NLP or text mining)
Text data often are stored in the following ways
tf-idf
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.1 ✓ purrr 0.3.4
## ✓ tibble 3.0.6 ✓ dplyr 1.0.4
## ✓ tidyr 1.1.0 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(tidytext)
# A poem by Emily Dickinson
text <- c("Because I could not stop for Death -",
"He kindly stopped for me -",
"The Carriage held but just Ourselves -",
"and Immortality")
df <- tibble(line = 1:4, text = text)
We want to convert the data frame so that it has one-token-per-document-per-row.
df %>%
unnest_tokens(word, text)
## # A tibble: 20 x 2
## line word
## <int> <chr>
## 1 1 because
## 2 1 i
## 3 1 could
## 4 1 not
## 5 1 stop
## 6 1 for
## 7 1 death
## 8 2 he
## 9 2 kindly
## 10 2 stopped
## 11 2 for
## 12 2 me
## 13 3 the
## 14 3 carriage
## 15 3 held
## 16 3 but
## 17 3 just
## 18 3 ourselves
## 19 4 and
## 20 4 immortality
After using unnest_tokens
, we’ve split each row so that there is one token (word) in each row of the new data frame; the default tokenization in unnest_tokens()
is for single words (unigram), as shown here. Also notice:
unnest_tokens()
converts the tokens to lowercase, which makes them easier to compare or combine with other datasets. (Use the to_lower = FALSE
argument to turn off this behavior).Often in text analysis, we will want to remove stop words; stop words are words that are not useful for an analysis, typically extremely common words such as “the”, “of”, “to”, and so forth in English.
data(stop_words)
stop_words
## # A tibble: 1,149 x 2
## word lexicon
## <chr> <chr>
## 1 a SMART
## 2 a's SMART
## 3 able SMART
## 4 about SMART
## 5 above SMART
## 6 according SMART
## 7 accordingly SMART
## 8 across SMART
## 9 actually SMART
## 10 after SMART
## # … with 1,139 more rows
df %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
## Joining, by = "word"
## # A tibble: 7 x 2
## line word
## <int> <chr>
## 1 1 stop
## 2 1 death
## 3 2 kindly
## 4 2 stopped
## 5 3 carriage
## 6 3 held
## 7 4 immortality
library(rvest)
## Loading required package: xml2
##
## Attaching package: 'rvest'
## The following object is masked from 'package:purrr':
##
## pluck
## The following object is masked from 'package:readr':
##
## guess_encoding
Try to read some news headlines from abc news
headlines <- read_html("https://abcnews.go.com/") %>%
html_nodes("div.headlines-li-div")
news_df <- tibble(
title = headlines %>% html_text() %>% str_trim(),
url = headlines %>% html_node("a") %>% html_attr("href")
) %>%
filter(str_detect(url, fixed("https://abcnews.go.com"))) %>%
distinct(url, .keep_all = TRUE) %>%
mutate(id = sprintf("%02d", row_number()))
Read the contents
news_content <- NULL
for (i in seq_len(nrow(news_df))) {
news_content <- bind_rows(
news_content,
read_html(news_df$url[i]) %>%
html_nodes("div.Article p") %>%
html_text() %>% {
tibble(id = news_df$id[i], text = c(news_df$title[i], .))
}
)
}
news_tokens <- news_content %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
group_by(id) %>%
count(word, sort = TRUE) %>%
arrange(id)
## Joining, by = "word"
Use relative frequencies
(news_propotions <- news_tokens %>%
group_by(id) %>%
mutate(propotion = n / sum(n)) %>%
select(-n) %>%
arrange(id, desc(propotion)))
## # A tibble: 7,642 x 3
## # Groups: id [32]
## id word propotion
## <chr> <chr> <dbl>
## 1 01 call 0.0379
## 2 01 trump 0.0337
## 3 01 watson 0.0253
## 4 01 recording 0.0211
## 5 01 news 0.0147
## 6 01 abc 0.0126
## 7 01 georgia 0.0126
## 8 01 audio 0.0105
## 9 01 election 0.0105
## 10 01 investigator 0.0105
## # … with 7,632 more rows
library(wordcloud)
## Loading required package: RColorBrewer
news_tokens %>%
filter(id == "01") %>%
with(wordcloud(
word, n, min.freq = 2, max.words = 100, random.order = FALSE,
colors = brewer.pal(8, "Dark2")))
news_tokens %>%
slice_max(n, n = 5, with_ties = FALSE) %>%
ggplot(aes(x = fct_reorder(word, n), y = n, fill = id)) +
geom_bar(stat = "identity", na.rm = TRUE, show.legend = FALSE) +
xlab("word") +
facet_wrap(~id, scales = "free") +
coord_flip()
The order the categories in each facet are not ordered correctly. Following https://drsimonj.svbtle.com/ordering-categories-within-ggplot2-facets, I got the following fix.
news_tokens %>%
slice_max(n, n = 5, with_ties = FALSE) %>%
ungroup() %>%
arrange(id, n) %>%
mutate(order = row_number()) %>% {
ggplot(., aes(x = order, y = n, fill = id)) +
geom_bar(stat = "identity", na.rm = TRUE, show.legend = FALSE) +
xlab("word") +
facet_wrap(~id, scales = "free") +
coord_flip() +
scale_x_continuous(breaks = .$order, labels = .$word, expand = c(0, 0))
}
The tidytext
package contains several sentiment lexicons. Three general-purpose lexicons are
AFINN
from Finn Årup Nielsen,bing
from Bing Liu and collaborators, andnrc
from Saif Mohammad and Peter Turney.All three of these lexicons are based on unigrams, i.e., single words. These lexicons contain many English words and the words are assigned scores for positive/negative sentiment, and also possibly emotions like joy, anger, sadness, and so forth. It is important to keep in mind that these methods do not take into account qualifiers before a word, such as in “no good” or “not true”; a lexicon-based method like this is based on unigrams only.
library(textdata)
get_sentiments("afinn")
## # A tibble: 2,477 x 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # … with 2,467 more rows
get_sentiments("bing")
## # A tibble: 6,786 x 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # … with 6,776 more rows
get_sentiments("nrc")
## # A tibble: 13,901 x 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # … with 13,891 more rows
news_tokens %>%
left_join(get_sentiments("bing")) %>%
group_by(id) %>%
summarize( # the frequencies are ignored in this analysis
positive = sum(sentiment == "positive", na.rm = TRUE),
negative = sum(sentiment == "negative", na.rm = TRUE),
netural = n() - positive - negative) %>%
mutate(
id,
sentiment = case_when(
positive > negative ~ "positive",
positive < negative ~ "negative",
TRUE ~ "netural"
)
) %>%
left_join(select(news_df, id, title)) %>%
mutate(title = str_trunc(title, 80)) %>%
select(sentiment, title)
## Joining, by = "word"
## Joining, by = "id"
## # A tibble: 32 x 2
## sentiment title
## <chr> <chr>
## 1 negative WSJ obtains audio of Trump, Ga. investigator call
## 2 negative 5 officers charged for killing robbery suspect
## 3 negative State approves near-total abortion ban
## 4 positive Merrick Garland confirmed as attorney general
## 5 negative Mexico lawmakers advance bill to legalize cannabis
## 6 negative Suspect nabbed after Asian American man attacked
## 7 netural What's inside the COVID-19 bill passed by Congress
## 8 negative FBI must tackle extremists in military: Raskin
## 9 negative Ghislaine Maxwell's 3rd quest for bail opposed
## 10 negative 2 more jurors chosen for Derek Chauvin trial
## # … with 22 more rows
A central question in text mining and natural language processing is how to quantify what a document is about. One measure of how important a word may be is its term frequency. Another approach is to look at a term’s inverse document frequency (idf), which decreases the weight for commonly used words and increases the weight for words that are not used very much.
\[ \text{idf} = \ln\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}} \right) \]
\[ \text{tf_idf} = \text{tf} \times \text{idf} \]
Show the 3 most important words in each document.
news_tokens %>%
bind_tf_idf(word, id, n) %>%
slice_max(tf_idf, n = 3) %>%
select(id, word, n, tf_idf)
## # A tibble: 101 x 4
## # Groups: id [32]
## id word n tf_idf
## <chr> <chr> <int> <dbl>
## 1 01 watson 12 0.0876
## 2 01 call 18 0.0788
## 3 01 recording 10 0.0730
## 4 02 rodriguez 12 0.113
## 5 02 officers 17 0.0771
## 6 02 oklahoma 10 0.0641
## 7 03 abortion 11 0.133
## 8 03 abortions 4 0.0483
## 9 03 guttmacher 4 0.0483
## 10 03 hutchinson 4 0.0483
## # … with 91 more rows
Using term frequency and inverse document frequency allows us to find words that are characteristic for one document within a collection of documents.
We’ve been using the unnest_tokens function to tokenize by word, or sometimes by sentence, which is useful for the kinds of sentiment and frequency analyses we’ve been doing so far. But we can also use the function to tokenize into consecutive sequences of words, called n-grams.
Our sentiment analysis approach above simply counted the appearance of positive or negative words. One of the problems with this approach is that a word’s context can matter nearly as much as its presence. For example, the words “happy” and “like” will be counted as positive, even in a sentence like “I’m not happy and I don’t like it!”
# stop words are not removed in this case
news_tokens2 <- news_content %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ")
Back to the sentiment analysis
negate_words <- c("not", "without", "no", "can't", "don't", "won't")
news_tokens2 %>%
group_by(id) %>%
count(word1, word2) %>%
left_join(get_sentiments("bing"), by = c("word2" = "word")) %>%
mutate(sentiment = case_when(
word1 %in% negate_words & sentiment == "negative" ~ "positive",
word1 %in% negate_words & sentiment == "positive" ~ "negative",
TRUE ~ sentiment)) %>%
summarize(
positive = sum(sentiment == "positive", na.rm = TRUE),
negative = sum(sentiment == "negative", na.rm = TRUE),
netural = n() - positive - negative) %>%
mutate(
id,
sentiment = case_when(
positive > negative ~ "positive",
positive < negative ~ "negative",
TRUE ~ "netural"
)
) %>%
left_join(select(news_df, id, title)) %>%
mutate(title = str_trunc(title, 80)) %>%
select(sentiment, title)
## Joining, by = "id"
## # A tibble: 32 x 2
## sentiment title
## <chr> <chr>
## 1 positive WSJ obtains audio of Trump, Ga. investigator call
## 2 negative 5 officers charged for killing robbery suspect
## 3 negative State approves near-total abortion ban
## 4 positive Merrick Garland confirmed as attorney general
## 5 negative Mexico lawmakers advance bill to legalize cannabis
## 6 negative Suspect nabbed after Asian American man attacked
## 7 positive What's inside the COVID-19 bill passed by Congress
## 8 negative FBI must tackle extremists in military: Raskin
## 9 negative Ghislaine Maxwell's 3rd quest for bail opposed
## 10 negative 2 more jurors chosen for Derek Chauvin trial
## # … with 22 more rows
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters)
First, we will need to define similarity. One way to quantity how similar two documents are is the cosine distance.
Compare to regular euclidean distance, it is more situable for sparse vectors.
library(proxy)
##
## Attaching package: 'proxy'
## The following objects are masked from 'package:stats':
##
## as.dist, dist
## The following object is masked from 'package:base':
##
## as.matrix
docsdissim <- dist(as.matrix(cast_dtm(news_tokens, id, word, n)), method = "cosine")
h <- hclust(docsdissim, method = "ward.D2")
plot(h)
We could use cutree
to pick the number of groups.
tibble(topic = cutree(h, k = 3), title = news_df$title) %>%
arrange(topic) %>%
mutate(title = str_trunc(title, 80)) %>%
select(topic, title)
## # A tibble: 32 x 2
## topic title
## <int> <chr>
## 1 1 WSJ obtains audio of Trump, Ga. investigator call
## 2 1 5 officers charged for killing robbery suspect
## 3 1 State approves near-total abortion ban
## 4 1 Merrick Garland confirmed as attorney general
## 5 1 Mexico lawmakers advance bill to legalize cannabis
## 6 1 Suspect nabbed after Asian American man attacked
## 7 1 FBI must tackle extremists in military: Raskin
## 8 1 Ghislaine Maxwell's 3rd quest for bail opposed
## 9 1 2 more jurors chosen for Derek Chauvin trial
## 10 1 Paraplegic inmate denied shower for 5 months: Suit
## # … with 22 more rows
In cluster analysis, we only assign a single topic to a single document. In reality, a single document may contain multiple topics.
Latent Dirichlet allocation is one of the most common algorithms for topic modeling. Without diving into the math behind the model, we can understand it as being guided by two principles.
library(topicmodels)
We want to allocate the topics and documents into 4 different topics.
news_lda <- LDA(
cast_dtm(news_tokens, id, word, n),
k = 4,
control = list(seed = 1234))
The following table shows the probability of a specific term occurring in a topic.
tidy(news_lda, matrix = "beta") %>%
group_by(topic) %>%
top_n(10, beta) %>%
arrange(topic, desc(beta))
## # A tibble: 43 x 3
## # Groups: topic [4]
## topic term beta
## <int> <chr> <dbl>
## 1 1 health 0.0104
## 2 1 students 0.00898
## 3 1 water 0.00826
## 4 1 news 0.00575
## 5 1 utah 0.00539
## 6 1 prices 0.00503
## 7 1 colorado 0.00503
## 8 1 marijuana 0.00467
## 9 1 schools 0.00467
## 10 1 mental 0.00467
## # … with 33 more rows
The following table shows the probability of a specific topic occurring in a document.
# t is the variable `alpha` on the wikipedia page of LDA
tidy(news_lda, matrix = "gamma") %>%
pivot_wider(names_from = topic, values_from = gamma)
## # A tibble: 32 x 5
## document `1` `2` `3` `4`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 01 0.0000489 0.0000489 1.00 0.0000489
## 2 02 0.0000629 1.00 0.0000629 0.0000629
## 3 03 0.0000809 1.00 0.0000809 0.0000809
## 4 04 0.0000801 0.651 0.349 0.0000801
## 5 05 1.00 0.0000921 0.0000921 0.0000921
## 6 06 0.000195 0.999 0.000195 0.000195
## 7 07 0.0000405 0.0000405 1.00 0.0000405
## 8 08 0.0000505 0.0000505 1.00 0.0000505
## 9 09 0.0000918 0.0000918 1.00 0.0000918
## 10 10 0.0000358 1.00 0.0000358 0.0000358
## # … with 22 more rows
tidy(news_lda, "gamma") %>%
mutate(topic = as_factor(topic)) %>%
ggplot(aes(topic, gamma, fill = topic)) +
geom_bar(stat = "identity") +
facet_wrap(~document, 4) +
coord_flip()
Correlated topic model is an extension of LDA which supports correlations between topics.
news_ctm <- CTM(
cast_dtm(news_tokens, id, word, n),
k = 4,
control = list(seed = 1234) )
tidy(news_ctm, matrix = "beta") %>%
group_by(topic) %>%
top_n(10, beta) %>%
arrange(topic, desc(beta))
## # A tibble: 40 x 3
## # Groups: topic [4]
## topic term beta
## <int> <chr> <dbl>
## 1 1 water 0.0102
## 2 1 bill 0.0102
## 3 1 call 0.00801
## 4 1 trump 0.00713
## 5 1 billion 0.00668
## 6 1 utah 0.00668
## 7 1 news 0.00621
## 8 1 legislation 0.00579
## 9 1 river 0.00579
## 10 1 colorado 0.00534
## # … with 30 more rows
tidy(news_ctm, "gamma") %>%
mutate(topic = as_factor(topic)) %>%
ggplot(aes(topic, gamma, fill = topic)) +
geom_bar(stat = "identity") +
facet_wrap(~document, 4) +
coord_flip()