class: ur-title, center, middle, title-slide # BST430 Lecture 10 ## Text Processing ### Andrew McDavid ### U of Rochester ### 2021-09-26 (updated: 2021-10-06) --- ## Cardiac diagnoses ```r diagnoses = read_csv('l10/data/cardiac-dx.csv') diagnoses ``` ``` ## # A tibble: 100 × 3 ## id gender diagnoses ## <dbl> <chr> <chr> ## 1 26108 F {a,d,d} | abdominal situs ambiguous (abdominal he… ## 2 10949 M {s,d,l} | aortic stenosis - valvar | atrial septa… ## 3 8090 F {s,l,l} | aortic valve position relative to the p… ## 4 19800 M {s,l,l} | crisscross atrioventricular valves | de… ## 5 2708 M aberrant left subclavian artery | hypoplastic mai… ## 6 14031 F absence of the suprarenal inferior vena cava with… ## 7 4185 M aortic arch hypoplasia | aortic atresia | coronar… ## 8 313 F aortic arch hypoplasia | aortic atresia | hypopla… ## # … with 92 more rows ``` .question[How to get all the diagnoses of hypoplastic left heart?] --- ## Attempt 1 ```r diagnoses %>% count(diagnoses) %>% arrange(desc(n)) ``` ``` ## # A tibble: 67 × 2 ## diagnoses n ## <chr> <int> ## 1 screener diagnosis: none 26 ## 2 hypoplastic left heart syndrome | screener diagnosis: no… 5 ## 3 atrial septal defect, secundum | screener diagnosis: atr… 3 ## 4 aortic atresia | hypoplastic left heart syndrome | mitra… 2 ## 5 screener diagnosis: none | tetralogy of fallot 2 ## 6 {a,d,d} | abdominal situs ambiguous (abdominal heterotax… 1 ## 7 {s,d,l} | aortic stenosis - valvar | atrial septal defec… 1 ## 8 {s,l,l} | aortic valve position relative to the pulmonar… 1 ## # … with 59 more rows ``` -- Too many unique diagnoses to make headway with this. --- ## Using string matching ```r filter(diagnoses, str_detect(diagnoses, 'hypoplastic')) ``` ``` ## # A tibble: 50 × 3 ## id gender diagnoses ## <dbl> <chr> <chr> ## 1 26108 F {a,d,d} | abdominal situs ambiguous (abdominal he… ## 2 10949 M {s,d,l} | aortic stenosis - valvar | atrial septa… ## 3 19800 M {s,l,l} | crisscross atrioventricular valves | de… ## 4 2708 M aberrant left subclavian artery | hypoplastic mai… ## 5 14031 F absence of the suprarenal inferior vena cava with… ## 6 4185 M aortic arch hypoplasia | aortic atresia | coronar… ## 7 313 F aortic arch hypoplasia | aortic atresia | hypopla… ## 8 10792 M aortic arch hypoplasia | aortic atresia | hypopla… ## # … with 42 more rows ``` .question[What different sort of hypoplasties are there?] --- ## Split into pieces ```r diagnoses_row = diagnoses %>% tidyr::separate_rows(diagnoses, * sep = " \\| ") #WTH? diagnoses_row ``` ``` ## # A tibble: 412 × 3 ## id gender diagnoses ## <dbl> <chr> <chr> ## 1 26108 F {a,d,d} ## 2 26108 F abdominal situs ambiguous (abdominal heterotaxy) ## 3 26108 F aortic atresia ## 4 26108 F double outlet right ventricle ## 5 26108 F heterotaxy syndrome ## 6 26108 F inferior vena cava, left sided ## 7 26108 F screener diagnosis: other: cavc, aa, hypoplastic … ... ``` --- ## Split into pieces ```r diagnoses_row %>% filter(str_detect(diagnoses, 'hypoplastic')) %>% count(diagnoses) %>% arrange(desc(n)) ``` ``` ## # A tibble: 15 × 2 ## diagnoses n ## <chr> <int> ## 1 hypoplastic left heart syndrome 21 ## 2 hypoplastic right ventricle (subnormal cavity volume) 8 ## 3 hypoplastic left ventricle (subnormal cavity volume) 7 ## 4 hypoplastic mitral valve 7 ## 5 hypoplastic left pulmonary artery 6 ## 6 hypoplastic main pulmonary artery 6 ## 7 hypoplastic right pulmonary artery 5 ## 8 hypoplastic tricuspid valve 3 ## # … with 7 more rows ``` --- ## Plot co-occurrence .panelset[ .panel[.panel-name[Code] ```r diagnoses_row %>% filter(str_detect(diagnoses, 'hypoplastic')) %>% ggplot(aes(y = diagnoses, x = as.factor(id))) + geom_tile() + scale_x_discrete(breaks = NULL) + labs(y = "Diagnosis", x = 'Patient', main = 'Co-occurrence of hypoplastic heart disorders') ``` ].panel[.panel-name[Plot] <img src="l10-text_files/figure-html/unnamed-chunk-8-1.png" width="75%" style="display: block; margin: auto;" /> ] ] --- ### Sensible factor orders .panelset[ .panel[.panel-name[Code] ```r diagnoses_row %>% filter(str_detect(diagnoses, 'hypoplastic')) %>% * ggplot(aes(y = fct_infreq(diagnoses), x = fct_infreq(as.factor(id)))) + geom_tile() + scale_x_discrete(breaks = NULL) + labs(y = "Diagnosis", x = 'Patient', main = 'Co-occurrence of hypoplastic heart disorders') ``` ].panel[.panel-name[Plot] <img src="l10-text_files/figure-html/unnamed-chunk-9-1.png" width="75%" style="display: block; margin: auto;" /> ] ] --- ### Wrap text width .panelset[ .panel[.panel-name[Code] ```r diagnoses_row %>% filter(str_detect(diagnoses, 'hypoplastic')) %>% * mutate(diagnoses = str_wrap(diagnoses, width = 40)) %>% ggplot(aes(y = fct_infreq(diagnoses), x = fct_infreq(as.factor(id)))) + geom_tile() + scale_x_discrete(breaks = NULL) + labs(y = "Diagnosis", x = 'Patient', main = 'Co-occurrence of hypoplastic heart disorders') ``` ].panel[.panel-name[Plot] <img src="l10-text_files/figure-html/unnamed-chunk-10-1.png" width="75%" style="display: block; margin: auto;" /> ] ] --- ### Adjust text size and justification .panelset[ .panel[.panel-name[Code] ```r diagnoses_row %>% filter(str_detect(diagnoses, 'hypoplastic')) %>% mutate(diagnoses = str_wrap(diagnoses, width = 40)) %>% ggplot(aes(y = fct_infreq(diagnoses), x = fct_infreq(as.factor(id)))) + geom_tile() + * theme(axis.text.y = element_text(hjust = 0, vjust = 0, size = 8)) + scale_x_discrete(breaks = NULL) + labs(y = "Diagnosis", x = 'Patient', main = 'Co-occurrence of hypoplastic heart disorders') ``` ].panel[.panel-name[Plot] <img src="l10-text_files/figure-html/unnamed-chunk-11-1.png" width="75%" style="display: block; margin: auto;" /> ] ] --- # Topics * Why text processing * Low level processing: * concatenate, count characters, substring, split strings. * Regular expressions (aka regex): * detect, extract, replace * Text mining * Tokenizing, filtering, analysis --- ## Low level text processing * concatenate with `stringr::str_c()` and `glue::glue()` * count characters with `base::nchar()` * extract and replace substrings with `stringr::str_sub()` * split with `str_split_fixed()` (generally) or `str_split()` (less often) --- ## Packages .pull-left[ `stringr` and `glue` rationalize much of text processing, which is otherwise a bit of a thicket in R. ] .pull-right[   ] --- ## Concatenate strings .pull-left[ ```r names = c("Jeff B.", "Larry E.", "Warren B.") favorite_food = c("caviar", "cake", "Pappy Van Winkle") str_c(names, " likes ", #note additional spaces favorite_food, ".") ``` ``` ## [1] "Jeff B. likes caviar." ## [2] "Larry E. likes cake." ## [3] "Warren B. likes Pappy Van Winkle." ``` ] .pull-right[ ```r dinner = glue::glue("{names} likes {favorite_food}.") dinner ``` ``` ## Jeff B. likes caviar. ## Larry E. likes cake. ## Warren B. likes Pappy Van Winkle. ``` ] --- ## Some special characters * \n newline * \r carriage return * \t tab * \f form feed * \Unnnnnnnn Unicode character with given code * \\\ literal backslash (.alert[this one will prove to be especially annoying...]) * \" literal quote Others are listed in `?'"'` (the help on the quote function). --- ## Glue with newlines and unicode ```r glue::glue("{names} \n {favorite_food} \U1F600.") ``` ``` ## Jeff B. ## caviar 😀. ## Larry E. ## cake 😀. ## Warren B. ## Pappy Van Winkle 😀. ``` --- ## Count characters ```r names ``` ``` ## [1] "Jeff B." "Larry E." "Warren B." ``` ```r nchar(names) ``` ``` ## [1] 7 8 9 ``` --- ## Extract substrings .pull-left[ Extract ```r str_sub(dinner, 1, 11) ``` ``` ## [1] "Jeff B. lik" "Larry E. li" "Warren B. l" ``` ] .pull-right[ Replace ```r str_sub(dinner, #space + l- nchar(names) + 2, #space + l-i-k-e nchar(names) + 6 ) = "demands" dinner ``` ``` ## [1] "Jeff B. demands caviar." ## [2] "Larry E. demands cake." ## [3] "Warren B. demands Pappy Van Winkle." ``` ] --- ## split strings Get a character matrix, padding / collapsing excess fields. ```r str_split_fixed(dinner, " ", 4) ``` ``` ## [,1] [,2] [,3] [,4] ## [1,] "Jeff" "B." "demands" "caviar." ## [2,] "Larry" "E." "demands" "cake." ## [3,] "Warren" "B." "demands" "Pappy Van Winkle." ``` ```r str_split_fixed(dinner, " ", 6) ``` ``` ## [,1] [,2] [,3] [,4] [,5] [,6] ## [1,] "Jeff" "B." "demands" "caviar." "" "" ## [2,] "Larry" "E." "demands" "cake." "" "" ## [3,] "Warren" "B." "demands" "Pappy" "Van" "Winkle." ``` --- ## split strings Get exactly what you ask for. ```r str_split(dinner, " ") ``` ``` ## [[1]] ## [1] "Jeff" "B." "demands" "caviar." ## ## [[2]] ## [1] "Larry" "E." "demands" "cake." ## ## [[3]] ## [1] "Warren" "B." "demands" "Pappy" "Van" "Winkle." ``` Also recall `tidyr::separate` and `tidyr::separate_rows`. --- ## Other handy low-level string manipulations * Change case `str_to_lower()`, `str_to_upper()`, `str_to_title()` * Remove trailing/leading `str_trim()` or repeated `str_squish()` whitespace * Wrap long lines `stringr::str_wrap()` * Truncate `str_trunc()` or abbreviate `base::abbreviate()` long strings. --- ## Application Exercise .hand[Your turn.] Browse to [Lecture 10 - Text - AE](https://rstudio.cloud/spaces/162296/project/2985078) and complete the exercises in text-munge.Rmd. --- class: middle .hand[Regular Expressions]: Uff, probably wouldn't have done this way in hindsight. --- ## Regular expressions > Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. -- Jamie Zawinski (creator of Mozilla) * Are like *find-replace*, *wildcards* \* ? on 💊 **and** 🍄 * Are found in nearly every computer language * Can be just the ticket to solving *some* problems --- class: bg-green ## Syntax Write what you want to match (if it's alpha-numeric). ```r lunch = c("one app", "two appetizers", "three apples") str_view_all(lunch, 'apple') ```
--- class: bg-green ## Match multiple things: wildcard `.` is a generic wildcard, matches any character. ```r str_view_all(lunch, 'app.') ```
--- class: bg-green ## Match multiple things: character class `[<set>]` is a character class, matches all characters in `<set>`. Specify a range of characters with `[<char>-<char>]`. Invert a class with `[^<set>]`. ```r str_view_all(lunch, 'app[le]') ```
--- class: bg-green ## Match multiple things: disjunction `(<x>|<y>)` is a disjunction, matches `<x>` or `<y>`. ```r str_view_all(lunch, 'app(le|etizer)s') ```
--- class: bg-green ## Qualifiers modify matches 1. `*` zero or more matches 2. `?` zero or one matches 3. `+` one or more matches 4. `{min,max}` to match between min-max times. Compare back to `"app."`, which didn't match the first string. ```r str_view_all(lunch, 'app.*') ```
--- ## Match without consuming with zero-width identifiers * `^` matches a zero-width "character" present at the start of all lines. * `$` is the analogous character at the end * `\b` is between "words". For example, the string: `red tired` can be thought as .darkgreen[^\b]red.darkgreen[\b] .darkgreen[\b]tired.darkgreen[\b$] --- class: bg-green ## Require word boundary We must double the `\` to keep R from interpreting it as an escape character. ```r str_view_all("red tired", "\\bred\\b") ```
--- class: bg-green ## Match unconditionally ```r str_view_all("red tired", "red") ```
--- ## Using regular expressions * Test for an expression `str_detect()`. * Return first `str_extract()` or all `str_extract_all()` matching portions of string. * Return first `str_match()` or all `str_match_all()` matching portions of string **and capture groups**. * Replace first `str_replace()` or all `str_replace_all()` matching portions of string and capture groups. --- ## `str_detect()` ```r str_detect(string = c("A", "AA", "AB", "B"), pattern = "A") ``` ``` ## [1] TRUE TRUE TRUE FALSE ``` ```r str_detect(string = lunch, pattern = 'app.') ``` ``` ## [1] FALSE TRUE TRUE ``` --- ## `str_extract()` ```r feline = c("The fur of cats goes by many names.", "Infimum (the cat) is a cat with a most baleful meow.", "Dog.") str_extract(string = feline, pattern = "cat") ``` ``` ## [1] "cat" "cat" NA ``` ```r str_extract_all(string = feline, pattern = "cat") ``` ``` ## [[1]] ## [1] "cat" ## ## [[2]] ## [1] "cat" "cat" ## ## [[3]] ## character(0) ``` --- ## `str_match()` For simple queries, behaves like `str_extract()` ```r feline = c("The fur of cats goes by many names.", "Infimum (the cat) is a cat with a most baleful meow.", "Dog.") str_match(feline, "cat") ``` ``` ## [,1] ## [1,] "cat" ## [2,] "cat" ## [3,] NA ``` --- ## `str_match()` But returns **capture groups** `(<expression>)` separately. ```r feline = c("The fur of cats goes by many names.", "Infimum (the cat) is a cat with a most baleful meow.", "Dog.") str_match(feline, "(\\w*) cat.? (\\w*)") ``` ``` ## [,1] [,2] [,3] ## [1,] "of cats goes" "of" "goes" ## [2,] "the cat) is" "the" "is" ## [3,] NA NA NA ``` `\w = [A-Za-z0-9_]`, we must double the `\` to keep R from interpreting it as an escape character. --- ## `str_match_all()` ```r feline = c("The fur of cats goes by many names.", "Infimum (the cat) is a cat with a most baleful meow.", "Dog.") str_match_all(feline, "(\\w*) cat.? (\\w*)") ``` ``` ## [[1]] ## [,1] [,2] [,3] ## [1,] "of cats goes" "of" "goes" ## ## [[2]] ## [,1] [,2] [,3] ## [1,] "the cat) is" "the" "is" ## [2,] "a cat with" "a" "with" ## ## [[3]] ## [,1] [,2] [,3] ``` --- ## `str_replace()` ```r feline = c("The fur of cats goes by many names.", "Infimum (the cat) is a cat with a most baleful meow.", "Dog.") str_replace(feline, "cat", "murder machine") ``` ``` ## [1] "The fur of murder machines goes by many names." ## [2] "Infimum (the murder machine) is a cat with a most baleful meow." ## [3] "Dog." ``` --- ## `str_replace_all()` ```r feline = c("The fur of cats goes by many names.", "Infimum (the cat) is a cat with a most baleful meow.", "Dog.") str_replace_all(feline, "cat", "murder machine") ``` ``` ## [1] "The fur of murder machines goes by many names." ## [2] "Infimum (the murder machine) is a murder machine with a most baleful meow." ## [3] "Dog." ``` --- ## `str_replace()` also can use capture groups Use `\1` to refer to the first capture group, `\2` for the second, etc. Note the `\\` because `\` must be escaped in R. ```r str_replace_all(feline, "(\\w*)", "\\1\\1") ``` ``` ## [1] "TheThe furfur ofof catscats goesgoes byby manymany namesnames." ## [2] "InfimumInfimum (thethe catcat) isis aa catcat withwith aa mostmost balefulbaleful meowmeow." ## [3] "DogDog." ``` Actually, any regular expression can use a capture group, both for matching and replacing! --- ## Application exercise .hand[Your turn]. .alert[Form pairs<sup>1</sup>], and browse again to [Lecture 10 - Text AE](https://rstudio.cloud/spaces/162296/project/2985078) and complete the exercises in regex-text. You might also experiment with installing `devtools::install_github("gadenbuie/regexplain")`. See https://www.garrickadenbuie.com/project/regexplain/ for example usage. .footnote[[1] By this, I highly encourage only one person to code at a time -- the other person should just watch and spout beta.] --- class: middle  Text mining using `tidytext` --- ## Text mining using `tidytext` Text is inherently high-dimensional and noisy data. We could spent weeks on this. Instead, we'll have to be content to know what we don't know: * Sampling text data and its potential ascertainment biases * Handling non-Roman (ASCII) characters * Parsing into tokens * Filtering low-content words * Dimension reduction, e.g., latent Dirichlet allocation or non-negative matrix factorization * Embeddings using pre-trained neural networks Julia Silge has [one book on classical text mining](https://www.tidytextmining.com/) and [another on machine learning on text](https://smltar.com/). --- ## Most important functionality * `unnest_tokens()` split a string into tokens (words, bi-grams, etc) as a data frame * `bind_tf_idf` calculate term and inverse-document frequencies. * `cast_sparse` convert to a (sparse) document-term matrix. --- ## Austin vs Kafka ```r library(tidytext) book_names = tibble(gutenberg_id = c(158, 1342, 5200, 7849), title = c('Emma', 'Pride and prejudice', 'Metamorphosis', 'The Trial')) books = gutenbergr::gutenberg_download(book_names$gutenberg_id) %>% left_join(book_names) ``` .scroll-box-10[ ```r books %>% group_by(title) %>% slice_head(n=6) ``` ``` ## # A tibble: 24 × 3 ## # Groups: title [4] ## gutenberg_id text title ## <dbl> <chr> <chr> ## 1 158 "Emma" Emma ## 2 158 "" Emma ## 3 158 "by Jane Austen" Emma ## 4 158 "" Emma ## 5 158 "" Emma ## 6 158 "Contents" Emma ## 7 5200 "Metamorphosis" Metamorphosis ## 8 5200 "" Metamorphosis ## # … with 16 more rows ``` ] --- ## Get words ```r book_words = unnest_tokens(books, text, output = 'word', drop = TRUE) book_words ``` ``` ## # A tibble: 389,746 × 3 ## gutenberg_id title word ## <dbl> <chr> <chr> ## 1 158 Emma emma ## 2 158 Emma by ## 3 158 Emma jane ## 4 158 Emma austen ## 5 158 Emma contents ## 6 158 Emma volume ## 7 158 Emma i ## 8 158 Emma chapter ## # … with 389,738 more rows ``` --- ## Count words by book ```r word_counts = book_words %>% group_by(title) %>% count(title, word) %>% arrange(desc(n)) word_counts %>% slice_head(n = 3) ``` ``` ## # A tibble: 12 × 3 ## # Groups: title [4] ## title word n ## <chr> <chr> <int> ## 1 Emma to 5238 ## 2 Emma the 5201 ## 3 Emma and 4896 ## 4 Metamorphosis the 1148 ## 5 Metamorphosis to 753 ## 6 Metamorphosis and 642 ## 7 Pride and prejudice the 4333 ## 8 Pride and prejudice to 4163 ## 9 Pride and prejudice of 3612 ## 10 The Trial the 4725 ## 11 The Trial to 2855 ## 12 The Trial and 2006 ``` --- ## Remove "stop" words Stop words are common, low-semantic value words. Sometimes useful to remove. ```r word_counts %>% anti_join(get_stopwords()) %>% slice_head(n = 3) ``` ``` ## # A tibble: 12 × 3 ## # Groups: title [4] ## title word n ## <chr> <chr> <int> ## 1 Emma mr 1153 ## 2 Emma emma 786 ## 3 Emma mrs 699 ## 4 Metamorphosis gregor 199 ## 5 Metamorphosis room 131 ## 6 Metamorphosis gregor’s 99 ## 7 Pride and prejudice mr 784 ## 8 Pride and prejudice elizabeth 596 ## 9 Pride and prejudice said 402 ## 10 The Trial k 1176 ## 11 The Trial said 770 ## 12 The Trial now 312 ``` --- ## Term frequency in Kafka vs Austin ```r total_words = word_counts %>% group_by(title) %>% summarize(total = sum(n)) word_counts = left_join(word_counts, total_words) word_counts ``` ``` ## # A tibble: 21,423 × 4 ## # Groups: title [4] ## title word n total ## <chr> <chr> <int> <int> ## 1 Emma to 5238 161113 ## 2 Emma the 5201 161113 ## 3 Emma and 4896 161113 ## 4 The Trial the 4725 84218 ## 5 Pride and prejudice the 4333 122359 ## 6 Emma of 4291 161113 ## 7 Pride and prejudice to 4163 122359 ## 8 Pride and prejudice of 3612 122359 ## # … with 21,415 more rows ``` --- ## Term frequency in Kafka vs Austin ```r ggplot(word_counts, aes(n/total)) + geom_histogram(show.legend = FALSE) + xlim(NA, 0.0009) + facet_wrap(~title, ncol = 2, scales = "free_y") + theme_minimal() ``` <img src="l10-text_files/figure-html/plottf-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Zipf's law Distributions like those on the previous slide are typical in language. A classic version of this relationship is called Zipf's law. > Zipf's law states that the frequency that a word appears is inversely proportional to its rank. --- ## Zipf's law .panelset[ .panel[.panel-name[Code] ```r freq_by_rank = word_counts %>% group_by(title) %>% mutate(rank = row_number(), `term frequency` = n/total) %>% ungroup() freq_by_rank %>% ggplot(aes(x = rank, y = `term frequency`, color = title)) + geom_abline(intercept = -0.62, slope = -1, color = "gray50", linetype = 2) + geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) + scale_x_log10() + scale_y_log10() + theme_minimal() ``` ] .panel[.panel-name[Plot] <img src="l10-text_files/figure-html/unnamed-chunk-43-1.png" width="60%" style="display: block; margin: auto;" /> ] ] --- ## Sentiment analysis ```r word_sentiments = word_counts %>% * left_join(sentiments) %>% filter(!is.na(sentiment)) %>% group_by(title) %>% mutate(word_collapse = fct_lump_n(word, n = 10, w = n), word_collapse = fct_reorder(word_collapse, n, sum)) %>% select(title, word_collapse, sentiment, n) word_sentiments ``` ``` ## # A tibble: 4,469 × 4 ## # Groups: title [4] ## title word_collapse sentiment n ## <chr> <fct> <chr> <int> ## 1 Emma miss negative 599 ## 2 Emma well positive 401 ## 3 Emma good positive 359 ## 4 Pride and prejudice miss negative 283 ## 5 Emma great positive 264 ## 6 Pride and prejudice well positive 224 ## 7 The Trial like positive 212 ## 8 Emma like positive 200 ## # … with 4,461 more rows ``` --- ## Which is more happy? ```r ggplot(word_sentiments, aes(y = fct_reorder(word_collapse, n, .fun = sum), x = n, fill = sentiment)) + geom_col() + facet_wrap(~title, scales = 'free_x') + ylab("Word") + xlab("Occurrence") + theme_minimal() ``` <img src="l10-text_files/figure-html/unnamed-chunk-45-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Term frequency and inverse document frequency The inverse document frequency is `$$\text{idf}(\text{term}) = \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)}$$` The IDF thus ranges from 0 for words that appear in every document up to `\(log(n)\)` for a word unique across documents. The term frequency is just the word counts, normalized to the number of words per text, so the popular TF-IDF<sup>1</sup> metric is just `$$\text{tf-idf}(\text{term}) = \text{idf}(\text{term}) \times \text{tf}(\text{term})$$` .footnote[[1] Popular, and curiously devoid of an obvious statistical model. [Some attempts to link to information theory](https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Justification_of_idf) have been made.] --- ## Calculate TF-IDF ```r word_counts = word_counts %>% bind_tf_idf(word, title, n) word_counts ``` ``` ## # A tibble: 21,423 × 7 ## # Groups: title [4] ## title word n total tf idf tf_idf ## <chr> <chr> <int> <int> <dbl> <dbl> <dbl> ## 1 Emma to 5238 161113 0.0325 0 0 ## 2 Emma the 5201 161113 0.0323 0 0 ## 3 Emma and 4896 161113 0.0304 0 0 ## 4 The Trial the 4725 84218 0.0561 0 0 ## 5 Pride and prejudice the 4333 122359 0.0354 0 0 ## 6 Emma of 4291 161113 0.0266 0 0 ## 7 Pride and prejudice to 4163 122359 0.0340 0 0 ## 8 Pride and prejudice of 3612 122359 0.0295 0 0 ... ``` --- ## TF-IDF of Kafka and Austen This words relatively well to identify signature words -- some represent content, some represent author style (e.g. contractions used by Kafka) <img src="l10-text_files/figure-html/unnamed-chunk-46-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Occurrence matrix Lastly, we might want to convert our counts to an occurrence matrix `\(\mathbf X = [x_{ij}]\)` where `\(x_{ij}\)` is the number of times document `\(i\)` contains term `\(j\)`. Most `\(x_{ij}\)` will be zero, reflecting Zipf's law. We will almost always want to store it in a special format called a .alert[sparse matrix], that only stores the non-zero entries and their index in the matrix. --- ## `cast_sparse()` ```r X = cast_sparse(word_counts, title, word, n) class(X) ``` ``` ## [1] "dgCMatrix" ## attr(,"package") ## [1] "Matrix" ``` ```r dim(X) ``` ``` ## [1] 4 11754 ``` ```r sum(X>0) ``` ``` ## [1] 21423 ``` This is useful for downstream modeling, such as latent Dirichlet allocation. --- # Resources Julia Silge has [one book on classical text mining](https://www.tidytextmining.com/) and [another on machine learning on text](https://smltar.com/).