Text Mining

There are quite a number of R packages for doing text mining. For example,

tm
quanteda
tidytext
corpus
koRpus

See here for a nice comparison between the packages: https://quanteda.io/articles/pkgdown/comparison.html

We will focus on tidytext for its simplicity.

So what do we do with text mining?

(from https://www.linguamatics.com/) > Text mining (also referred to as text analytics) is an artificial intelligence (AI) technology that uses natural language processing (NLP) to transform the free (unstructured) text in documents and databases into normalized, structured data suitable for analysis or to drive machine learning (ML) algorithms.

(Disclaimer: I am not an expert of NLP or text mining)

Text data often are stored in the following ways

String: Text can, of course, be stored as strings, i.e., character vectors, within R, and often text data is first read into memory in this form.
Corpus: These types of objects typically contain raw strings annotated with additional metadata and details.
Document-term matrix: This is a sparse matrix describing a collection (i.e., a corpus) of documents with one row for each document and one column for each term. The value in the matrix is typically word count or tf-idf

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.1     ✓ purrr   0.3.4
## ✓ tibble  3.0.6     ✓ dplyr   1.0.4
## ✓ tidyr   1.1.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(tidytext)

# A poem by Emily Dickinson 
text <- c("Because I could not stop for Death -",
          "He kindly stopped for me -",
          "The Carriage held but just Ourselves -",
          "and Immortality")
df <- tibble(line = 1:4, text = text)

We want to convert the data frame so that it has one-token-per-document-per-row.

df %>% 
  unnest_tokens(word, text)

## # A tibble: 20 x 2
##     line word       
##    <int> <chr>      
##  1     1 because    
##  2     1 i          
##  3     1 could      
##  4     1 not        
##  5     1 stop       
##  6     1 for        
##  7     1 death      
##  8     2 he         
##  9     2 kindly     
## 10     2 stopped    
## 11     2 for        
## 12     2 me         
## 13     3 the        
## 14     3 carriage   
## 15     3 held       
## 16     3 but        
## 17     3 just       
## 18     3 ourselves  
## 19     4 and        
## 20     4 immortality

After using unnest_tokens, we’ve split each row so that there is one token (word) in each row of the new data frame; the default tokenization in unnest_tokens() is for single words (unigram), as shown here. Also notice:

Other columns, such as the line number each word came from, are retained.
Punctuation has been stripped.
By default, unnest_tokens() converts the tokens to lowercase, which makes them easier to compare or combine with other datasets. (Use the to_lower = FALSE argument to turn off this behavior).

Often in text analysis, we will want to remove stop words; stop words are words that are not useful for an analysis, typically extremely common words such as “the”, “of”, “to”, and so forth in English.

data(stop_words)
stop_words

## # A tibble: 1,149 x 2
##    word        lexicon
##    <chr>       <chr>  
##  1 a           SMART  
##  2 a's         SMART  
##  3 able        SMART  
##  4 about       SMART  
##  5 above       SMART  
##  6 according   SMART  
##  7 accordingly SMART  
##  8 across      SMART  
##  9 actually    SMART  
## 10 after       SMART  
## # … with 1,139 more rows

df %>% 
  unnest_tokens(word, text) %>% 
  anti_join(stop_words)

## Joining, by = "word"

## # A tibble: 7 x 2
##    line word       
##   <int> <chr>      
## 1     1 stop       
## 2     1 death      
## 3     2 kindly     
## 4     2 stopped    
## 5     3 carriage   
## 6     3 held       
## 7     4 immortality

library(rvest)

## Loading required package: xml2

## 
## Attaching package: 'rvest'

## The following object is masked from 'package:purrr':
## 
##     pluck

## The following object is masked from 'package:readr':
## 
##     guess_encoding

Try to read some news headlines from abc news

headlines <- read_html("https://abcnews.go.com/") %>% 
  html_nodes("div.headlines-li-div")

news_df <- tibble(
    title = headlines %>% html_text() %>% str_trim(),
    url = headlines %>%  html_node("a") %>% html_attr("href")
  ) %>% 
  filter(str_detect(url, fixed("https://abcnews.go.com"))) %>% 
  distinct(url, .keep_all = TRUE) %>% 
  mutate(id = sprintf("%02d", row_number()))

Read the contents

news_content <- NULL
for (i in seq_len(nrow(news_df))) {
  news_content <- bind_rows(
    news_content, 
    read_html(news_df$url[i]) %>% 
      html_nodes("div.Article p") %>% 
      html_text() %>% {
        tibble(id = news_df$id[i], text = c(news_df$title[i], .))
      }
  )
}

news_tokens <- news_content %>%
  unnest_tokens(word, text) %>% 
  anti_join(stop_words) %>% 
  group_by(id) %>%
  count(word, sort = TRUE) %>% 
  arrange(id)

## Joining, by = "word"

Use relative frequencies

(news_propotions <- news_tokens %>% 
  group_by(id) %>% 
  mutate(propotion = n / sum(n)) %>% 
  select(-n) %>%
  arrange(id, desc(propotion)))

## # A tibble: 7,642 x 3
## # Groups:   id [32]
##    id    word         propotion
##    <chr> <chr>            <dbl>
##  1 01    call            0.0379
##  2 01    trump           0.0337
##  3 01    watson          0.0253
##  4 01    recording       0.0211
##  5 01    news            0.0147
##  6 01    abc             0.0126
##  7 01    georgia         0.0126
##  8 01    audio           0.0105
##  9 01    election        0.0105
## 10 01    investigator    0.0105
## # … with 7,632 more rows

library(wordcloud)

## Loading required package: RColorBrewer

news_tokens %>% 
  filter(id == "01") %>% 
  with(wordcloud(
    word, n, min.freq = 2, max.words = 100, random.order = FALSE,
    colors = brewer.pal(8, "Dark2")))

news_tokens %>%
  slice_max(n, n = 5, with_ties = FALSE) %>%
  ggplot(aes(x = fct_reorder(word, n), y = n, fill = id)) + 
      geom_bar(stat = "identity", na.rm = TRUE, show.legend = FALSE) +
      xlab("word") +
      facet_wrap(~id, scales = "free") +
      coord_flip()

The order the categories in each facet are not ordered correctly. Following https://drsimonj.svbtle.com/ordering-categories-within-ggplot2-facets, I got the following fix.

news_tokens %>%
  slice_max(n, n = 5, with_ties = FALSE) %>%
  ungroup() %>%
  arrange(id, n) %>% 
  mutate(order = row_number()) %>% {
    ggplot(., aes(x = order, y = n, fill = id)) + 
      geom_bar(stat = "identity", na.rm = TRUE, show.legend = FALSE) +
      xlab("word") +
      facet_wrap(~id, scales = "free") +
      coord_flip() +
      scale_x_continuous(breaks = .$order, labels = .$word, expand = c(0, 0))
  }

Sentiment analysis

The tidytext package contains several sentiment lexicons. Three general-purpose lexicons are

AFINN from Finn Årup Nielsen,
bing from Bing Liu and collaborators, and
nrc from Saif Mohammad and Peter Turney.

All three of these lexicons are based on unigrams, i.e., single words. These lexicons contain many English words and the words are assigned scores for positive/negative sentiment, and also possibly emotions like joy, anger, sadness, and so forth. It is important to keep in mind that these methods do not take into account qualifiers before a word, such as in “no good” or “not true”; a lexicon-based method like this is based on unigrams only.

library(textdata)

get_sentiments("afinn")

## # A tibble: 2,477 x 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # … with 2,467 more rows

get_sentiments("bing")

## # A tibble: 6,786 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # … with 6,776 more rows

get_sentiments("nrc")

## # A tibble: 13,901 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # … with 13,891 more rows

news_tokens %>% 
  left_join(get_sentiments("bing")) %>% 
  group_by(id) %>% 
  summarize(  # the frequencies are ignored in this analysis
    positive = sum(sentiment == "positive", na.rm = TRUE), 
    negative = sum(sentiment == "negative", na.rm = TRUE), 
    netural = n() - positive - negative) %>%
  mutate(
    id,
    sentiment = case_when(
      positive > negative ~ "positive",
      positive < negative ~ "negative",
      TRUE ~ "netural"
    )
  ) %>% 
  left_join(select(news_df, id, title)) %>% 
  mutate(title = str_trunc(title, 80)) %>% 
  select(sentiment, title)

## Joining, by = "word"

## Joining, by = "id"

## # A tibble: 32 x 2
##    sentiment title                                             
##    <chr>     <chr>                                             
##  1 negative  WSJ obtains audio of Trump, Ga. investigator call 
##  2 negative  5 officers charged for killing robbery suspect    
##  3 negative  State approves near-total abortion ban            
##  4 positive  Merrick Garland confirmed as attorney general     
##  5 negative  Mexico lawmakers advance bill to legalize cannabis
##  6 negative  Suspect nabbed after Asian American man attacked  
##  7 netural   What's inside the COVID-19 bill passed by Congress
##  8 negative  FBI must tackle extremists in military: Raskin    
##  9 negative  Ghislaine Maxwell's 3rd quest for bail opposed    
## 10 negative  2 more jurors chosen for Derek Chauvin trial      
## # … with 22 more rows

Analyzing word and document frequency: tf-idf

A central question in text mining and natural language processing is how to quantify what a document is about. One measure of how important a word may be is its term frequency. Another approach is to look at a term’s inverse document frequency (idf), which decreases the weight for commonly used words and increases the weight for words that are not used very much.

\[ \text{idf} = \ln\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}} \right) \]

\[ \text{tf_idf} = \text{tf} \times \text{idf} \]

Show the 3 most important words in each document.

news_tokens %>% 
  bind_tf_idf(word, id, n) %>%
  slice_max(tf_idf, n = 3) %>%
  select(id, word, n, tf_idf)

## # A tibble: 101 x 4
## # Groups:   id [32]
##    id    word           n tf_idf
##    <chr> <chr>      <int>  <dbl>
##  1 01    watson        12 0.0876
##  2 01    call          18 0.0788
##  3 01    recording     10 0.0730
##  4 02    rodriguez     12 0.113 
##  5 02    officers      17 0.0771
##  6 02    oklahoma      10 0.0641
##  7 03    abortion      11 0.133 
##  8 03    abortions      4 0.0483
##  9 03    guttmacher     4 0.0483
## 10 03    hutchinson     4 0.0483
## # … with 91 more rows

Using term frequency and inverse document frequency allows us to find words that are characteristic for one document within a collection of documents.

n-grams

We’ve been using the unnest_tokens function to tokenize by word, or sometimes by sentence, which is useful for the kinds of sentiment and frequency analyses we’ve been doing so far. But we can also use the function to tokenize into consecutive sequences of words, called n-grams.

Using bigrams to provide context in sentiment analysis

Our sentiment analysis approach above simply counted the appearance of positive or negative words. One of the problems with this approach is that a word’s context can matter nearly as much as its presence. For example, the words “happy” and “like” will be counted as positive, even in a sentence like “I’m not happy and I don’t like it!”

# stop words are not removed in this case
news_tokens2 <- news_content %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>% 
  separate(bigram, c("word1", "word2"), sep = " ")

Back to the sentiment analysis

negate_words <- c("not", "without", "no", "can't", "don't", "won't")

news_tokens2 %>% 
  group_by(id) %>% 
  count(word1, word2) %>% 
  left_join(get_sentiments("bing"), by = c("word2" = "word")) %>%
  mutate(sentiment = case_when(
    word1 %in% negate_words & sentiment == "negative" ~ "positive", 
    word1 %in% negate_words & sentiment == "positive" ~ "negative",
    TRUE ~ sentiment)) %>% 
  summarize(
    positive = sum(sentiment == "positive", na.rm = TRUE), 
    negative = sum(sentiment == "negative", na.rm = TRUE), 
    netural = n() - positive - negative) %>%
  mutate(
    id,
    sentiment = case_when(
      positive > negative ~ "positive",
      positive < negative ~ "negative",
      TRUE ~ "netural"
    )
  ) %>% 
  left_join(select(news_df, id, title)) %>% 
  mutate(title = str_trunc(title, 80)) %>% 
  select(sentiment, title)

## Joining, by = "id"

## # A tibble: 32 x 2
##    sentiment title                                             
##    <chr>     <chr>                                             
##  1 positive  WSJ obtains audio of Trump, Ga. investigator call 
##  2 negative  5 officers charged for killing robbery suspect    
##  3 negative  State approves near-total abortion ban            
##  4 positive  Merrick Garland confirmed as attorney general     
##  5 negative  Mexico lawmakers advance bill to legalize cannabis
##  6 negative  Suspect nabbed after Asian American man attacked  
##  7 positive  What's inside the COVID-19 bill passed by Congress
##  8 negative  FBI must tackle extremists in military: Raskin    
##  9 negative  Ghislaine Maxwell's 3rd quest for bail opposed    
## 10 negative  2 more jurors chosen for Derek Chauvin trial      
## # … with 22 more rows

Cluster analysis

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters)

First, we will need to define similarity. One way to quantity how similar two documents are is the cosine distance.

Compare to regular euclidean distance, it is more situable for sparse vectors.

library(proxy)

## 
## Attaching package: 'proxy'

## The following objects are masked from 'package:stats':
## 
##     as.dist, dist

## The following object is masked from 'package:base':
## 
##     as.matrix

docsdissim <- dist(as.matrix(cast_dtm(news_tokens, id, word, n)), method = "cosine")

h <- hclust(docsdissim, method = "ward.D2")
plot(h)

We could use cutree to pick the number of groups.

tibble(topic = cutree(h, k = 3), title = news_df$title) %>% 
  arrange(topic) %>% 
  mutate(title = str_trunc(title, 80)) %>% 
  select(topic, title)

## # A tibble: 32 x 2
##    topic title                                             
##    <int> <chr>                                             
##  1     1 WSJ obtains audio of Trump, Ga. investigator call 
##  2     1 5 officers charged for killing robbery suspect    
##  3     1 State approves near-total abortion ban            
##  4     1 Merrick Garland confirmed as attorney general     
##  5     1 Mexico lawmakers advance bill to legalize cannabis
##  6     1 Suspect nabbed after Asian American man attacked  
##  7     1 FBI must tackle extremists in military: Raskin    
##  8     1 Ghislaine Maxwell's 3rd quest for bail opposed    
##  9     1 2 more jurors chosen for Derek Chauvin trial      
## 10     1 Paraplegic inmate denied shower for 5 months: Suit
## # … with 22 more rows

In cluster analysis, we only assign a single topic to a single document. In reality, a single document may contain multiple topics.

LDA

Latent Dirichlet allocation is one of the most common algorithms for topic modeling. Without diving into the math behind the model, we can understand it as being guided by two principles.

Every topic is a mixture of words
Every document is a mixture of topics

library(topicmodels)

We want to allocate the topics and documents into 4 different topics.

news_lda <- LDA(
  cast_dtm(news_tokens, id, word, n), 
  k = 4,
  control = list(seed = 1234))

Word-topic probabilities

The following table shows the probability of a specific term occurring in a topic.

tidy(news_lda, matrix = "beta") %>% 
  group_by(topic) %>% 
  top_n(10, beta) %>% 
  arrange(topic, desc(beta))

## # A tibble: 43 x 3
## # Groups:   topic [4]
##    topic term         beta
##    <int> <chr>       <dbl>
##  1     1 health    0.0104 
##  2     1 students  0.00898
##  3     1 water     0.00826
##  4     1 news      0.00575
##  5     1 utah      0.00539
##  6     1 prices    0.00503
##  7     1 colorado  0.00503
##  8     1 marijuana 0.00467
##  9     1 schools   0.00467
## 10     1 mental    0.00467
## # … with 33 more rows

Document-topic probabilities

The following table shows the probability of a specific topic occurring in a document.

# t is the variable `alpha` on the wikipedia page of LDA
tidy(news_lda, matrix = "gamma") %>%
    pivot_wider(names_from = topic, values_from = gamma)

## # A tibble: 32 x 5
##    document       `1`       `2`       `3`       `4`
##    <chr>        <dbl>     <dbl>     <dbl>     <dbl>
##  1 01       0.0000489 0.0000489 1.00      0.0000489
##  2 02       0.0000629 1.00      0.0000629 0.0000629
##  3 03       0.0000809 1.00      0.0000809 0.0000809
##  4 04       0.0000801 0.651     0.349     0.0000801
##  5 05       1.00      0.0000921 0.0000921 0.0000921
##  6 06       0.000195  0.999     0.000195  0.000195 
##  7 07       0.0000405 0.0000405 1.00      0.0000405
##  8 08       0.0000505 0.0000505 1.00      0.0000505
##  9 09       0.0000918 0.0000918 1.00      0.0000918
## 10 10       0.0000358 1.00      0.0000358 0.0000358
## # … with 22 more rows

tidy(news_lda, "gamma") %>%
  mutate(topic = as_factor(topic)) %>% 
  ggplot(aes(topic, gamma, fill = topic)) + 
  geom_bar(stat = "identity") + 
  facet_wrap(~document, 4) + 
  coord_flip()

CTM

Correlated topic model is an extension of LDA which supports correlations between topics.

news_ctm <- CTM(
  cast_dtm(news_tokens, id, word, n), 
  k = 4, 
  control = list(seed = 1234) )

tidy(news_ctm, matrix = "beta") %>% 
  group_by(topic) %>% 
  top_n(10, beta) %>% 
  arrange(topic, desc(beta))

## # A tibble: 40 x 3
## # Groups:   topic [4]
##    topic term           beta
##    <int> <chr>         <dbl>
##  1     1 water       0.0102 
##  2     1 bill        0.0102 
##  3     1 call        0.00801
##  4     1 trump       0.00713
##  5     1 billion     0.00668
##  6     1 utah        0.00668
##  7     1 news        0.00621
##  8     1 legislation 0.00579
##  9     1 river       0.00579
## 10     1 colorado    0.00534
## # … with 30 more rows

tidy(news_ctm, "gamma") %>%
  mutate(topic = as_factor(topic)) %>% 
  ggplot(aes(topic, gamma, fill = topic)) + 
  geom_bar(stat = "identity") + 
  facet_wrap(~document, 4) + 
  coord_flip()

Regex