BST430 Lecture 10

# BST430 Lecture 10
## Text Processing
### Andrew McDavid
### U of Rochester
### 2021-09-26 (updated: 2021-10-06)

---

## Cardiac diagnoses

```r
diagnoses = read_csv('l10/data/cardiac-dx.csv')
diagnoses
```

```
## # A tibble: 100 × 3
##      id gender diagnoses                                         
##   <dbl> <chr>  <chr>                                             
## 1 26108 F      {a,d,d} | abdominal situs ambiguous (abdominal he…
## 2 10949 M      {s,d,l} | aortic stenosis - valvar | atrial septa…
## 3  8090 F      {s,l,l} | aortic valve position relative to the p…
## 4 19800 M      {s,l,l} | crisscross atrioventricular valves | de…
## 5  2708 M      aberrant left subclavian artery | hypoplastic mai…
## 6 14031 F      absence of the suprarenal inferior vena cava with…
## 7  4185 M      aortic arch hypoplasia | aortic atresia | coronar…
## 8   313 F      aortic arch hypoplasia | aortic atresia | hypopla…
## # … with 92 more rows
```

---

## Attempt 1

```r
diagnoses %>% count(diagnoses) %>% 
  arrange(desc(n))
```

```
## # A tibble: 67 × 2
##   diagnoses                                                     n
##   <chr>                                                     <int>
## 1 screener diagnosis: none                                     26
## 2 hypoplastic left heart syndrome | screener diagnosis: no…     5
## 3 atrial septal defect, secundum | screener diagnosis: atr…     3
## 4 aortic atresia | hypoplastic left heart syndrome | mitra…     2
## 5 screener diagnosis: none | tetralogy of fallot                2
## 6 {a,d,d} | abdominal situs ambiguous (abdominal heterotax…     1
## 7 {s,d,l} | aortic stenosis - valvar | atrial septal defec…     1
## 8 {s,l,l} | aortic valve position relative to the pulmonar…     1
## # … with 59 more rows
```

Too many unique diagnoses to make headway with this.

---

## Using string matching

```r
filter(diagnoses, str_detect(diagnoses, 'hypoplastic'))
```

```
## # A tibble: 50 × 3
##      id gender diagnoses                                         
##   <dbl> <chr>  <chr>                                             
## 1 26108 F      {a,d,d} | abdominal situs ambiguous (abdominal he…
## 2 10949 M      {s,d,l} | aortic stenosis - valvar | atrial septa…
## 3 19800 M      {s,l,l} | crisscross atrioventricular valves | de…
## 4  2708 M      aberrant left subclavian artery | hypoplastic mai…
## 5 14031 F      absence of the suprarenal inferior vena cava with…
## 6  4185 M      aortic arch hypoplasia | aortic atresia | coronar…
## 7   313 F      aortic arch hypoplasia | aortic atresia | hypopla…
## 8 10792 M      aortic arch hypoplasia | aortic atresia | hypopla…
## # … with 42 more rows
```

---

## Split into pieces

```r
diagnoses_row = diagnoses %>% 
  tidyr::separate_rows(diagnoses, 
*                      sep = " \\| ") #WTH?
diagnoses_row
```

```
## # A tibble: 412 × 3
##      id gender diagnoses                                         
##   <dbl> <chr>  <chr>                                             
## 1 26108 F      {a,d,d}                                           
## 2 26108 F      abdominal situs ambiguous (abdominal heterotaxy)  
## 3 26108 F      aortic atresia                                    
## 4 26108 F      double outlet right ventricle                     
## 5 26108 F      heterotaxy syndrome                               
## 6 26108 F      inferior vena cava, left sided                    
## 7 26108 F      screener diagnosis: other: cavc, aa, hypoplastic …
...
```

---

## Split into pieces

```r
diagnoses_row %>% filter(str_detect(diagnoses, 'hypoplastic')) %>%
  count(diagnoses) %>% arrange(desc(n))
```

```
## # A tibble: 15 × 2
##   diagnoses                                                 n
##   <chr>                                                 <int>
## 1 hypoplastic left heart syndrome                          21
## 2 hypoplastic right ventricle (subnormal cavity volume)     8
## 3 hypoplastic left ventricle (subnormal cavity volume)      7
## 4 hypoplastic mitral valve                                  7
## 5 hypoplastic left pulmonary artery                         6
## 6 hypoplastic main pulmonary artery                         6
## 7 hypoplastic right pulmonary artery                        5
## 8 hypoplastic tricuspid valve                               3
## # … with 7 more rows
```

---

## Plot co-occurrence

```r
diagnoses_row %>% filter(str_detect(diagnoses, 'hypoplastic')) %>%
  ggplot(aes(y = diagnoses, x = as.factor(id))) + 
  geom_tile() + 
  scale_x_discrete(breaks = NULL) +
  labs(y = "Diagnosis", x = 'Patient', 
       main = 'Co-occurrence of hypoplastic heart disorders')
```
].panel[.panel-name[Plot]
<img src="l10-text_files/figure-html/unnamed-chunk-8-1.png" width="75%" style="display: block; margin: auto;" />
]
]

---

### Sensible factor orders

```r
diagnoses_row %>% filter(str_detect(diagnoses, 'hypoplastic')) %>%
* ggplot(aes(y = fct_infreq(diagnoses), x = fct_infreq(as.factor(id)))) +
  geom_tile() + 
  scale_x_discrete(breaks = NULL) +
  labs(y = "Diagnosis", x = 'Patient', 
       main = 'Co-occurrence of hypoplastic heart disorders')
```
].panel[.panel-name[Plot]
<img src="l10-text_files/figure-html/unnamed-chunk-9-1.png" width="75%" style="display: block; margin: auto;" />
]
]

---

### Wrap text width

```r
diagnoses_row %>% filter(str_detect(diagnoses, 'hypoplastic')) %>%
* mutate(diagnoses = str_wrap(diagnoses, width = 40)) %>%
  ggplot(aes(y = fct_infreq(diagnoses), x = fct_infreq(as.factor(id)))) + 
  geom_tile() + 
  scale_x_discrete(breaks = NULL) +
  labs(y = "Diagnosis", x = 'Patient', 
       main = 'Co-occurrence of hypoplastic heart disorders')
```
].panel[.panel-name[Plot]
<img src="l10-text_files/figure-html/unnamed-chunk-10-1.png" width="75%" style="display: block; margin: auto;" />
]
]

---

### Adjust text size and justification

```r
diagnoses_row %>% filter(str_detect(diagnoses, 'hypoplastic')) %>%
  mutate(diagnoses = str_wrap(diagnoses, width = 40)) %>%
  ggplot(aes(y = fct_infreq(diagnoses), x = fct_infreq(as.factor(id)))) + 
  geom_tile() + 
* theme(axis.text.y = element_text(hjust = 0, vjust = 0, size = 8)) +
  scale_x_discrete(breaks = NULL) +
  labs(y = "Diagnosis", x = 'Patient', 
       main = 'Co-occurrence of hypoplastic heart disorders')
```
].panel[.panel-name[Plot]
<img src="l10-text_files/figure-html/unnamed-chunk-11-1.png" width="75%" style="display: block; margin: auto;" />
]
]

---

# Topics

*  Why text processing
*  Low level processing:
  *  concatenate, count characters, substring, split strings.
*  Regular expressions (aka regex):
  *  detect, extract, replace
*  Text mining
  *  Tokenizing, filtering, analysis

---

## Low level text processing

*  concatenate with `stringr::str_c()` and `glue::glue()`
*  count characters with `base::nchar()`
*  extract and replace substrings with `stringr::str_sub()`
*  split with `str_split_fixed()` (generally) or `str_split()` (less often)

---

## Packages

.pull-left[
`stringr` and `glue` rationalize much of text processing, which is otherwise a bit of a thicket in R.
]
.pull-right[
![](https://raw.githubusercontent.com/tidyverse/stringr/master/man/figures/logo.png)
![](https://raw.githubusercontent.com/tidyverse/glue/master/man/figures/logo.png)

]

---

## Concatenate strings

```r
names = c("Jeff B.", "Larry E.", "Warren B.")
favorite_food = c("caviar", "cake", "Pappy Van Winkle")
str_c(names, 
      " likes ", #note additional spaces
      favorite_food, ".")
```

```
## [1] "Jeff B. likes caviar."            
## [2] "Larry E. likes cake."             
## [3] "Warren B. likes Pappy Van Winkle."
```
]
.pull-right[

```r
dinner = glue::glue("{names} likes {favorite_food}.")
dinner
```

```
## Jeff B. likes caviar.
## Larry E. likes cake.
## Warren B. likes Pappy Van Winkle.
```
]

---

## Some special characters

*  \n newline
*  \r carriage return
*  \t tab
*  \f form feed
*  \Unnnnnnnn  Unicode character with given code
*  \\\ literal backslash (.alert[this one will prove to be especially annoying...])
*  \" literal quote

Others are listed in `?'"'` (the help on the quote function).

---

## Glue with newlines and unicode

```r
glue::glue("{names} \n {favorite_food} \U1F600.")
```

```
## Jeff B. 
## caviar 😀.
## Larry E. 
## cake 😀.
## Warren B. 
## Pappy Van Winkle 😀.
```

---

## Count characters

```r
names
```

```
## [1] "Jeff B."   "Larry E."  "Warren B."
```

```r
nchar(names)
```

```
## [1] 7 8 9
```

---

## Extract substrings

```r
str_sub(dinner, 1, 11)
```

```
## [1] "Jeff B. lik" "Larry E. li" "Warren B. l"
```
]
.pull-right[
Replace

```r
str_sub(dinner, 
        #space + l-
        nchar(names) + 2, 
        #space + l-i-k-e
        nchar(names) + 6 
) = "demands"
dinner
```

```
## [1] "Jeff B. demands caviar."            
## [2] "Larry E. demands cake."             
## [3] "Warren B. demands Pappy Van Winkle."
```
]

---

## split strings

Get a character matrix, padding / collapsing excess fields.

```r
str_split_fixed(dinner, " ",  4)
```

```
##      [,1]     [,2] [,3]      [,4]               
## [1,] "Jeff"   "B." "demands" "caviar."          
## [2,] "Larry"  "E." "demands" "cake."            
## [3,] "Warren" "B." "demands" "Pappy Van Winkle."
```

```r
str_split_fixed(dinner, " ", 6)
```

```
##      [,1]     [,2] [,3]      [,4]      [,5]  [,6]     
## [1,] "Jeff"   "B." "demands" "caviar." ""    ""       
## [2,] "Larry"  "E." "demands" "cake."   ""    ""       
## [3,] "Warren" "B." "demands" "Pappy"   "Van" "Winkle."
```

---

## split strings

Get exactly what you ask for.

```r
str_split(dinner, " ")
```

```
## [[1]]
## [1] "Jeff"    "B."      "demands" "caviar."
## 
## [[2]]
## [1] "Larry"   "E."      "demands" "cake."  
## 
## [[3]]
## [1] "Warren"  "B."      "demands" "Pappy"   "Van"     "Winkle."
```

Also recall `tidyr::separate` and `tidyr::separate_rows`.

---

## Other handy low-level string manipulations

*  Change case `str_to_lower()`, `str_to_upper()`, `str_to_title()`
*  Remove trailing/leading `str_trim()` or repeated `str_squish()` whitespace
*  Wrap long lines `stringr::str_wrap()`
*  Truncate `str_trunc()` or abbreviate `base::abbreviate()` long strings.

---

## Application Exercise

.hand[Your turn.] Browse to [Lecture 10 - Text - AE](https://rstudio.cloud/spaces/162296/project/2985078) and complete the exercises in text-munge.Rmd.

---

---

## Regular expressions

> Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
-- Jamie Zawinski (creator of Mozilla)

*  Are like *find-replace*, *wildcards* \* ? on 💊  **and** 🍄
*  Are found in nearly every computer language
*  Can be just the ticket to solving *some* problems

---
class: bg-green

## Syntax

Write what you want to match (if it's alpha-numeric).

```r
lunch = c("one app", "two appetizers", "three apples")
str_view_all(lunch, 'apple')
```

<div id="htmlwidget-1b4ff99564eb6e8884a5" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-1b4ff99564eb6e8884a5">{"x":{"html":"<ul>\n  <li>one app<\/li>\n  <li>two appetizers<\/li>\n  <li>three <span class='match'>apple<\/span>s<\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>

---
class: bg-green

## Match multiple things: wildcard

`.` is a generic wildcard, matches any character.

```r
str_view_all(lunch, 'app.')
```

<div id="htmlwidget-8f836166d559454ecd73" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-8f836166d559454ecd73">{"x":{"html":"<ul>\n  <li>one app<\/li>\n  <li>two <span class='match'>appe<\/span>tizers<\/li>\n  <li>three <span class='match'>appl<\/span>es<\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>

---
class: bg-green

## Match multiple things: character class

`[<set>]` is a character class, matches all characters in `<set>`.  Specify a range of characters with `[<char>-<char>]`.  Invert a class with `[^<set>]`.

```r
str_view_all(lunch, 'app[le]')
```

<div id="htmlwidget-0d7db73d03839427d3b7" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-0d7db73d03839427d3b7">{"x":{"html":"<ul>\n  <li>one app<\/li>\n  <li>two <span class='match'>appe<\/span>tizers<\/li>\n  <li>three <span class='match'>appl<\/span>es<\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>

---
class: bg-green

## Match multiple things: disjunction

`(<x>|<y>)` is a disjunction, matches `<x>` or `<y>`.

```r
str_view_all(lunch, 'app(le|etizer)s')
```

<div id="htmlwidget-c2e3fede5c798441fdae" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-c2e3fede5c798441fdae">{"x":{"html":"<ul>\n  <li>one app<\/li>\n  <li>two <span class='match'>appetizers<\/span><\/li>\n  <li>three <span class='match'>apples<\/span><\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>

---
class: bg-green

## Qualifiers modify matches

1. `*` zero or more matches
2. `?` zero or one matches
3. `+` one or more matches
4. `{min,max}` to match between min-max times.

Compare back to `"app."`, which didn't match the first string.

```r
str_view_all(lunch, 'app.*')
```

<div id="htmlwidget-a77ad32f6b82562872f5" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-a77ad32f6b82562872f5">{"x":{"html":"<ul>\n  <li>one <span class='match'>app<\/span><\/li>\n  <li>two <span class='match'>appetizers<\/span><\/li>\n  <li>three <span class='match'>apples<\/span><\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>

---

## Match without consuming with zero-width identifiers

*  `^` matches a zero-width "character" present at the start of all lines.
*  `$` is the analogous character at the end
*  `\b` is between "words".

For example, the string:

`red tired`

can be thought as

---
class: bg-green

## Require word boundary

We must double the `\` to keep R from interpreting it as an escape character.

```r
str_view_all("red tired", "\\bred\\b")
```

<div id="htmlwidget-e3479f292a01744eb530" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-e3479f292a01744eb530">{"x":{"html":"<ul>\n  <li><span class='match'>red<\/span> tired<\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>

---
class: bg-green

## Match unconditionally

```r
str_view_all("red tired", "red")
```

<div id="htmlwidget-65b7328530dcdf303887" style="width:960px;height:100%;" class="str_view html-widget"></div>
<script type="application/json" data-for="htmlwidget-65b7328530dcdf303887">{"x":{"html":"<ul>\n  <li><span class='match'>red<\/span> ti<span class='match'>red<\/span><\/li>\n<\/ul>"},"evals":[],"jsHooks":[]}</script>

---

## Using regular expressions

*  Test for an expression `str_detect()`.
*  Return first `str_extract()` or all `str_extract_all()` matching portions of string.
*  Return first `str_match()` or all `str_match_all()` matching portions of string **and capture groups**.
*  Replace first `str_replace()`  or all `str_replace_all()` matching portions of string and capture groups.

---

## `str_detect()`

```r
str_detect(string = c("A", "AA", "AB", "B"), 
           pattern = "A")
```

```
## [1]  TRUE  TRUE  TRUE FALSE
```

```r
str_detect(string = lunch, pattern = 'app.')
```

```
## [1] FALSE  TRUE  TRUE
```

---

## `str_extract()`

```r
feline = c("The fur of cats goes by many names.", 
           "Infimum (the cat) is a cat with a most baleful meow.",
           "Dog.")
str_extract(string = feline, pattern = "cat")
```

```
## [1] "cat" "cat" NA
```

```r
str_extract_all(string = feline, pattern = "cat")
```

```
## [[1]]
## [1] "cat"
## 
## [[2]]
## [1] "cat" "cat"
## 
## [[3]]
## character(0)
```

---

## `str_match()`

For simple queries, behaves like `str_extract()`

```r
feline = c("The fur of cats goes by many names.", 
           "Infimum (the cat) is a cat with a most baleful meow.",
           "Dog.")
str_match(feline, "cat")
```

```
##      [,1] 
## [1,] "cat"
## [2,] "cat"
## [3,] NA
```

---

## `str_match()`

But returns **capture groups** `(<expression>)` separately.

```r
feline = c("The fur of cats goes by many names.", 
           "Infimum (the cat) is a cat with a most baleful meow.",
           "Dog.")
str_match(feline, "(\\w*) cat.? (\\w*)")
```

```
##      [,1]           [,2]  [,3]  
## [1,] "of cats goes" "of"  "goes"
## [2,] "the cat) is"  "the" "is"  
## [3,] NA             NA    NA
```

`\w = [A-Za-z0-9_]`, we must double the `\` to keep R from interpreting it as an escape character.

---

## `str_match_all()`

```r
feline = c("The fur of cats goes by many names.", 
           "Infimum (the cat) is a cat with a most baleful meow.",
           "Dog.")

str_match_all(feline, "(\\w*) cat.? (\\w*)")
```

```
## [[1]]
##      [,1]           [,2] [,3]  
## [1,] "of cats goes" "of" "goes"
## 
## [[2]]
##      [,1]          [,2]  [,3]  
## [1,] "the cat) is" "the" "is"  
## [2,] "a cat with"  "a"   "with"
## 
## [[3]]
##      [,1] [,2] [,3]
```

---

## `str_replace()`

```r
feline = c("The fur of cats goes by many names.", 
           "Infimum (the cat) is a cat with a most baleful meow.",
           "Dog.")

str_replace(feline, "cat", "murder machine")
```

```
## [1] "The fur of murder machines goes by many names."                 
## [2] "Infimum (the murder machine) is a cat with a most baleful meow."
## [3] "Dog."
```

---

## `str_replace_all()`

```r
feline = c("The fur of cats goes by many names.", 
           "Infimum (the cat) is a cat with a most baleful meow.",
           "Dog.")

str_replace_all(feline, "cat", "murder machine")
```

```
## [1] "The fur of murder machines goes by many names."                            
## [2] "Infimum (the murder machine) is a murder machine with a most baleful meow."
## [3] "Dog."
```

---

## `str_replace()` also can use capture groups

Use `\1` to refer to the first capture group, `\2` for the second, etc.  Note the `\\` because `\` must be escaped in R.

```r
str_replace_all(feline, "(\\w*)", "\\1\\1")
```

```
## [1] "TheThe furfur ofof catscats goesgoes byby manymany namesnames."                             
## [2] "InfimumInfimum (thethe catcat) isis aa catcat withwith aa mostmost balefulbaleful meowmeow."
## [3] "DogDog."
```

Actually, any regular expression can use a capture group, both for matching and replacing!
---

## Application exercise

.hand[Your turn]. .alert[Form pairs<sup>1</sup>], and browse again to [Lecture 10 - Text AE](https://rstudio.cloud/spaces/162296/project/2985078) and complete the exercises in regex-text.

You might also experiment with installing `devtools::install_github("gadenbuie/regexplain")`.  See https://www.garrickadenbuie.com/project/regexplain/ for example usage.

.footnote[[1] By this, I highly encourage only one person to code at a time -- the other person should just watch and spout beta.]
---

![](https://raw.githubusercontent.com/juliasilge/tidytext/master/man/figures/tidytext.png)

Text mining using `tidytext`

---

## Text mining using `tidytext`

Text is inherently high-dimensional and noisy data.  We could spent weeks on this. Instead, we'll have to be content to know what we don't know:

*  Sampling text data and its potential ascertainment biases
*  Handling non-Roman (ASCII) characters
*  Parsing into tokens
*  Filtering low-content words
*  Dimension reduction, e.g., latent Dirichlet allocation or non-negative matrix factorization
*  Embeddings using pre-trained neural networks

Julia Silge has [one book on classical text mining](https://www.tidytextmining.com/) and [another on machine learning on text](https://smltar.com/).

---

## Most important functionality

*  `unnest_tokens()` split a string into tokens (words, bi-grams, etc) as a data frame
*  `bind_tf_idf` calculate term and inverse-document frequencies.
*  `cast_sparse` convert to a (sparse) document-term matrix.

---

## Austin vs Kafka

```r
library(tidytext)
book_names = tibble(gutenberg_id = c(158, 1342, 5200, 7849),
                    title = c('Emma', 'Pride and prejudice',
                              'Metamorphosis', 'The Trial'))
books = gutenbergr::gutenberg_download(book_names$gutenberg_id) %>% left_join(book_names)
```

```r
books %>% group_by(title) %>% slice_head(n=6)
```

```
## # A tibble: 24 × 3
## # Groups:   title [4]
##   gutenberg_id text             title        
##          <dbl> <chr>            <chr>        
## 1          158 "Emma"           Emma         
## 2          158 ""               Emma         
## 3          158 "by Jane Austen" Emma         
## 4          158 ""               Emma         
## 5          158 ""               Emma         
## 6          158 "Contents"       Emma         
## 7         5200 "Metamorphosis"  Metamorphosis
## 8         5200 ""               Metamorphosis
## # … with 16 more rows
```
]

---

## Get words

```r
book_words = unnest_tokens(books, text, output = 'word', drop = TRUE)
book_words
```

```
## # A tibble: 389,746 × 3
##   gutenberg_id title word    
##          <dbl> <chr> <chr>   
## 1          158 Emma  emma    
## 2          158 Emma  by      
## 3          158 Emma  jane    
## 4          158 Emma  austen  
## 5          158 Emma  contents
## 6          158 Emma  volume  
## 7          158 Emma  i       
## 8          158 Emma  chapter 
## # … with 389,738 more rows
```

---

## Count words by book

```r
word_counts = book_words %>%
  group_by(title) %>% count(title, word) %>% 
  arrange(desc(n))
word_counts %>% slice_head(n = 3)
```

```
## # A tibble: 12 × 3
## # Groups:   title [4]
##    title               word      n
##    <chr>               <chr> <int>
##  1 Emma                to     5238
##  2 Emma                the    5201
##  3 Emma                and    4896
##  4 Metamorphosis       the    1148
##  5 Metamorphosis       to      753
##  6 Metamorphosis       and     642
##  7 Pride and prejudice the    4333
##  8 Pride and prejudice to     4163
##  9 Pride and prejudice of     3612
## 10 The Trial           the    4725
## 11 The Trial           to     2855
## 12 The Trial           and    2006
```

---

## Remove "stop" words

Stop words are common, low-semantic value words.  Sometimes useful to remove.

```r
word_counts %>% anti_join(get_stopwords()) %>% slice_head(n = 3)
```

```
## # A tibble: 12 × 3
## # Groups:   title [4]
##    title               word          n
##    <chr>               <chr>     <int>
##  1 Emma                mr         1153
##  2 Emma                emma        786
##  3 Emma                mrs         699
##  4 Metamorphosis       gregor      199
##  5 Metamorphosis       room        131
##  6 Metamorphosis       gregor’s     99
##  7 Pride and prejudice mr          784
##  8 Pride and prejudice elizabeth   596
##  9 Pride and prejudice said        402
## 10 The Trial           k          1176
## 11 The Trial           said        770
## 12 The Trial           now         312
```

---

## Term frequency in Kafka vs Austin

```r
total_words = word_counts %>% 
  group_by(title) %>% 
  summarize(total = sum(n))
word_counts = left_join(word_counts, total_words)
word_counts 
```

```
## # A tibble: 21,423 × 4
## # Groups:   title [4]
##   title               word      n  total
##   <chr>               <chr> <int>  <int>
## 1 Emma                to     5238 161113
## 2 Emma                the    5201 161113
## 3 Emma                and    4896 161113
## 4 The Trial           the    4725  84218
## 5 Pride and prejudice the    4333 122359
## 6 Emma                of     4291 161113
## 7 Pride and prejudice to     4163 122359
## 8 Pride and prejudice of     3612 122359
## # … with 21,415 more rows
```

---

## Term frequency in Kafka vs Austin

```r
ggplot(word_counts, aes(n/total)) +
  geom_histogram(show.legend = FALSE) +
  xlim(NA, 0.0009) + 
  facet_wrap(~title, ncol = 2, scales = "free_y") + theme_minimal()
```

<img src="l10-text_files/figure-html/plottf-1.png" width="60%" style="display: block; margin: auto;" />
---

## Zipf's law
Distributions like those on the previous slide are typical in language.  A classic version of this relationship is called Zipf's law.

> Zipf's law states that the frequency that a word appears is inversely proportional to its rank.

---

## Zipf's law

```r
freq_by_rank = word_counts %>% 
  group_by(title) %>% 
  mutate(rank = row_number(), 
         `term frequency` = n/total) %>%
  ungroup()

freq_by_rank %>% 
  ggplot(aes(x = rank, y = `term frequency`, color = title)) + 
  geom_abline(intercept = -0.62, slope = -1, 
              color = "gray50", linetype = 2) +
  geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) + 
  scale_x_log10() +
  scale_y_log10() + 
  theme_minimal()
```
]
.panel[.panel-name[Plot]
<img src="l10-text_files/figure-html/unnamed-chunk-43-1.png" width="60%" style="display: block; margin: auto;" />
]
]
---

## Sentiment analysis

```r
word_sentiments = word_counts %>% 
* left_join(sentiments) %>%
  filter(!is.na(sentiment)) %>% 
  group_by(title) %>% 
  mutate(word_collapse = fct_lump_n(word, n = 10, w = n),
    word_collapse = fct_reorder(word_collapse, n, sum)) %>% 
    select(title, word_collapse, sentiment, n)
word_sentiments
```

```
## # A tibble: 4,469 × 4
## # Groups:   title [4]
##   title               word_collapse sentiment     n
##   <chr>               <fct>         <chr>     <int>
## 1 Emma                miss          negative    599
## 2 Emma                well          positive    401
## 3 Emma                good          positive    359
## 4 Pride and prejudice miss          negative    283
## 5 Emma                great         positive    264
## 6 Pride and prejudice well          positive    224
## 7 The Trial           like          positive    212
## 8 Emma                like          positive    200
## # … with 4,461 more rows
```

---

##  Which is more happy?

```r
ggplot(word_sentiments, aes(y = fct_reorder(word_collapse,  n, .fun = sum), x = n, fill = sentiment)) + geom_col() + facet_wrap(~title, scales = 'free_x') + ylab("Word") + xlab("Occurrence") + theme_minimal()
```

---

## Term frequency and inverse document frequency

The inverse document frequency is

`$$\text{idf}(\text{term}) = \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)}$$`

The IDF thus ranges from 0 for words that appear in every document up to `$log(n)$` for a word unique across documents.

The term frequency is just the word counts, normalized to the number of words per text, so the popular TF-IDF<sup>1</sup> metric is just

`$$\text{tf-idf}(\text{term}) = \text{idf}(\text{term}) \times \text{tf}(\text{term})$$`
.footnote[[1] Popular, and curiously devoid of an obvious statistical model.  [Some attempts to link to information theory](https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Justification_of_idf) have been made.]
---

## Calculate TF-IDF

```r
word_counts = word_counts %>% bind_tf_idf(word, title, n)
word_counts
```

```
## # A tibble: 21,423 × 7
## # Groups:   title [4]
##   title               word      n  total     tf   idf tf_idf
##   <chr>               <chr> <int>  <int>  <dbl> <dbl>  <dbl>
## 1 Emma                to     5238 161113 0.0325     0      0
## 2 Emma                the    5201 161113 0.0323     0      0
## 3 Emma                and    4896 161113 0.0304     0      0
## 4 The Trial           the    4725  84218 0.0561     0      0
## 5 Pride and prejudice the    4333 122359 0.0354     0      0
## 6 Emma                of     4291 161113 0.0266     0      0
## 7 Pride and prejudice to     4163 122359 0.0340     0      0
## 8 Pride and prejudice of     3612 122359 0.0295     0      0
...
```

---

## TF-IDF of Kafka and Austen

This words relatively well to identify signature words -- some represent content,  some represent author style (e.g. contractions used by Kafka)

---

## Occurrence matrix

Lastly, we might want to convert our counts to an occurrence matrix `$\mathbf X = [x_{ij}]$` where `$x_{ij}$` is the number of times document `$i$` contains term `$j$`.

Most `$x_{ij}$` will be zero, reflecting Zipf's law.  We will almost always want to store it in a special format called a .alert[sparse matrix], that only stores the non-zero entries and their index in the matrix.

---

## `cast_sparse()`

```r
X = cast_sparse(word_counts, title, word, n)
class(X)
```

```
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
```

```r
dim(X)
```

```
## [1]     4 11754
```

```r
sum(X>0)
```

```
## [1] 21423
```

This is useful for downstream modeling, such as latent Dirichlet allocation.

---

# Resources

Julia Silge has [one book on classical text mining](https://www.tidytextmining.com/) and [another on machine learning on text](https://smltar.com/).