2 Week 2: Tokenization and Word Frequency
Slides
- 3 Tokenization and Word Frequency (link or in Perusall)
2.1 Setup
As always, we first load the packages that we’ll be using:
library(tidyverse) # for wrangling data
library(tidylog) # to know what we are wrangling
library(tidytext) # for 'tidy' manipulation of text data
library(quanteda) # tokenization power house
library(quanteda.textstats)
library(quanteda.textplots)
library(wesanderson) # to prettify
library(readxl) # to read excel
library(kableExtra) # for displaying data in html format (relevant for formatting this worksheet mainly)2.2 Get Data:
For this example, we will be using small corpus of song lyrics.
sample_lyrics <- read_excel("data/lyrics_sample.xlsx")
head(sample_lyrics)## # A tibble: 6 × 5
## artist album year song lyrics
## <chr> <chr> <dbl> <chr> <chr>
## 1 Rage Against the Machine Evil Empire 1996 Bulls on Parade "Come…
## 2 Rage Against the Machine Rage Against the Machine 1992 Killing in the Name "Kill…
## 3 Rage Against the Machine Renegades 2000 Renegades of Funk "No m…
## 4 Rage Against the Machine The Battle of Los Angeles 1999 Sleep Now in the Fi… "Yeah…
## 5 Rage Against the Machine The Battle of Los Angeles 1999 Guerrilla Radio "Tran…
## 6 Rage Against the Machine The Battle of Los Angeles 1999 Testify "Uh!\…
Ok, so we have different artists, from different genres and years…
##
## Megan Thee Stallion Rage Against the Machine System of a Down
## 5 6 5
## Taylor Swift
## 5
And we have the lyrics in the following form:
## [1] "Yeah\r\n\r\nThe world is my expense\r\nIt’s the cost of my desire\r\nJesus blessed me with its future\r\nAnd I protect it with fire\r\n\r\nSo raise your fists and march around\r\nJust don’t take what you need\r\nI’ll jail and bury those committed\r\nAnd smother the rest in greed\r\n\r\nCrawl with me into tomorrow\r\nOr I’ll drag you to your grave\r\nI’m deep inside your children\r\nThey’ll betray you in my name\r\n\r\nHey, hey, sleep now in the fire\r\nHey, hey, sleep now in the fire\r\n\r\nThe lie is my expense\r\nThe scope of my desire\r\nThe party blessed me with its future\r\nAnd I protect it with fire\r\n\r\nI am the Niña, the Pinta, the Santa María\r\nThe noose and the rapist, the fields overseer\r\nThe Agents of Orange, the Priests of Hiroshima\r\nThe cost of my desire, sleep now in the fire\r\n\r\nHey, hey, sleep now in the fire\r\nHey, hey, sleep now in the fire\r\n\r\nFor it’s the end of history\r\nIt’s caged and frozen still\r\nThere is no other pill to take\r\nSo swallow the one that made you ill\r\n\r\nThe Niña, the Pinta, the Santa María\r\nThe noose and the rapist, the fields overseer\r\nThe Agents of Orange, the Priests of Hiroshima\r\nThe cost of my desire to sleep now in the fire\r\n\r\nYeah\r\n\r\nSleep now in the fire\r\nSleep now in the fire\r\nSleep now in the fire\r\nSleep now in the fire"
2.3 Cleaning the Text
Much like music, text comes in different forms and qualities. From the Regex workshop, you might remember that special characters can signal, for example, a new line (\n) or a carriage return (\r). For this example, we can remove them2. Before working with text, always check the state of your documents once they are loaded into your program of choice.
sample_lyrics <- sample_lyrics %>%
# Replace newline characters (\n) with a period.
# Note: "\\n" matches the literal newline escape sequence in the string.
mutate(
lyrics_clean = str_replace_all(lyrics, "\\n", "\\."),
# Replace carriage returns (\r) with a period as well.
lyrics_clean = str_replace_all(lyrics_clean, "\\r", "\\.")
) %>%
# Drop the original lyrics column to avoid keeping both raw and cleaned versions
select(-lyrics)## mutate: new variable 'lyrics_clean' (character) with 21 unique values and 0% NA
## select: dropped one variable (lyrics)
# Inspect the 4th cleaned lyric to confirm the replacements worked as intended
sample_lyrics$lyrics_clean[4]## [1] "Yeah....The world is my expense..It’s the cost of my desire..Jesus blessed me with its future..And I protect it with fire....So raise your fists and march around..Just don’t take what you need..I’ll jail and bury those committed..And smother the rest in greed....Crawl with me into tomorrow..Or I’ll drag you to your grave..I’m deep inside your children..They’ll betray you in my name....Hey, hey, sleep now in the fire..Hey, hey, sleep now in the fire....The lie is my expense..The scope of my desire..The party blessed me with its future..And I protect it with fire....I am the Niña, the Pinta, the Santa María..The noose and the rapist, the fields overseer..The Agents of Orange, the Priests of Hiroshima..The cost of my desire, sleep now in the fire....Hey, hey, sleep now in the fire..Hey, hey, sleep now in the fire....For it’s the end of history..It’s caged and frozen still..There is no other pill to take..So swallow the one that made you ill....The Niña, the Pinta, the Santa María..The noose and the rapist, the fields overseer..The Agents of Orange, the Priests of Hiroshima..The cost of my desire to sleep now in the fire....Yeah....Sleep now in the fire..Sleep now in the fire..Sleep now in the fire..Sleep now in the fire"
2.4 Tokenization
Our goal is to create a document-feature matrix, from which we will later extract information about word frequency. To do that, we start by creating a corpus object using the quanteda package.
# Create a quanteda corpus from the cleaned lyrics data frame.
# - text_field specifies which column contains the text to be treated as documents.
# - unique_docnames ensures each document gets a unique ID (useful when rows might share names/IDs).
corpus_lyrics <- corpus(
sample_lyrics,
text_field = "lyrics_clean",
unique_docnames = TRUE
)
# Quick overview of the corpus (number of documents, tokens, etc.)
summary(corpus_lyrics)## Corpus consisting of 21 documents, showing 21 documents:
##
## Text Types Tokens Sentences artist album
## text1 119 375 35 Rage Against the Machine Evil Empire
## text2 52 853 83 Rage Against the Machine Rage Against the Machine
## text3 188 835 91 Rage Against the Machine Renegades
## text4 97 352 38 Rage Against the Machine The Battle of Los Angeles
## text5 160 440 50 Rage Against the Machine The Battle of Los Angeles
## text6 133 535 67 Rage Against the Machine The Battle of Los Angeles
## text7 105 560 53 System of a Down Mezmerize
## text8 67 366 40 System of a Down Toxicity
## text9 68 298 33 System of a Down Toxicity
## text10 65 258 32 System of a Down Toxicity
## text11 137 558 68 System of a Down Toxicity
## text12 131 876 70 Taylor Swift 1989
## text13 159 465 41 Taylor Swift Midnights
## text14 162 544 62 Taylor Swift Fearless
## text15 196 738 84 Taylor Swift 1989
## text16 169 549 50 Taylor Swift Fearless
## text17 229 867 55 Megan Thee Stallion Traumazine
## text18 193 664 61 Megan Thee Stallion Suga
## text19 310 1190 87 Megan Thee Stallion Something for Thee Hotties
## text20 198 656 48 Megan Thee Stallion Traumazine
## text21 255 1092 73 Megan Thee Stallion Traumazine
## year song
## 1996 Bulls on Parade
## 1992 Killing in the Name
## 2000 Renegades of Funk
## 1999 Sleep Now in the Fire
## 1999 Guerrilla Radio
## 1999 Testify
## 2005 B.Y.O.B
## 2001 Chop Suey!
## 2001 Aerials
## 2001 Toxicty
## 2001 Sugar
## 2014 Shake it Off
## 2022 Anti-Hero
## 2008 You Belong With Me
## 2014 Blank Space
## 2008 Love Story
## 2022 Plan B
## 2020 Savage
## 2021 Thot Shit
## 2022 Her
## 2022 Ungrateful
Looks good. Now we can tokenize our corpus (and reduce complexity). One benefit of creating a corpus object first is that it preserves all the metadata for each document when we tokenize. This will come in handy later.
# Tokenize the corpus: split each document into tokens (typically words).
# Here we remove some elements that usually add noise for word-frequency analysis.
lyrics_toks <- tokens(
corpus_lyrics,
remove_numbers = TRUE, # remove tokens that are numbers (are these relevant?)
remove_punct = TRUE, # remove punctuation marks (e.g., commas, periods)
remove_url = TRUE # remove URLs (useful if lyrics contain links/metadata)
)
# Inspect a couple of tokenized documents (documents 4 and 14)
lyrics_toks[c(4, 14)]## Tokens consisting of 2 documents and 4 docvars.
## text4 :
## [1] "Yeah" "The" "world" "is" "my" "expense" "It’s" "the"
## [9] "cost" "of" "my" "desire"
## [ ... and 227 more ]
##
## text14 :
## [1] "You're" "on" "the" "phone" "with" "your"
## [7] "girlfriend" "she's" "upset" "She's" "going" "off"
## [ ... and 385 more ]
We got rid of punctuation. Now let’s remove stop words, high- and low-frequency words, and stem the remaining tokens. Here I am cheating, though: I already know which words are high- and low-frequency because I inspected my dfm (see the next code chunk).
# Remove stopwords and any additional terms you want to drop before building a dfm.
# - stopwords(language = "en") provides a standard English stopword list.
# - You can add/remove terms depending on your corpus and research question.
# - padding = FALSE drops removed tokens entirely (no placeholder tokens are kept).
lyrics_toks <- tokens_remove(
lyrics_toks,
c(
stopwords(language = "en"),
# "now" is very frequent in this corpus (identified after inspecting the dfm),
# and it is not substantively useful for our purposes here.
"now"
),
padding = FALSE
)
# Stem tokens to reduce inflected/derived words to a common root
# (e.g., "running", "runs" -> "run"), which reduces vocabulary size.
lyrics_toks_stem <- tokens_wordstem(lyrics_toks, language = "en")
# Compare the tokenized text before and after stemming for two example documents
lyrics_toks[c(4, 14)]## Tokens consisting of 2 documents and 4 docvars.
## text4 :
## [1] "Yeah" "world" "expense" "It’s" "cost" "desire" "Jesus" "blessed"
## [9] "future" "protect" "fire" "raise"
## [ ... and 105 more ]
##
## text14 :
## [1] "phone" "girlfriend" "upset" "going" "something" "said"
## [7] "Cause" "get" "humor" "like" "room" "typical"
## [ ... and 133 more ]
lyrics_toks_stem[c(4, 14)]## Tokens consisting of 2 documents and 4 docvars.
## text4 :
## [1] "Yeah" "world" "expens" "It’s" "cost" "desir" "Jesus" "bless"
## [9] "futur" "protect" "fire" "rais"
## [ ... and 105 more ]
##
## text14 :
## [1] "phone" "girlfriend" "upset" "go" "someth" "said"
## [7] "Caus" "get" "humor" "like" "room" "typic"
## [ ... and 133 more ]
We can compare the stemmed output and the non-stemmed output. Why did “future” become “futur”? Because stemming assumes that, for our purposes, “future” and “futuristic” should be treated as the same underlying root. Whether that assumption is appropriate depends on your research question. Finally, we can create our document-feature matrix (dfm).
# Create a document-feature matrix (dfm) from the tokens.
# Rows = documents; columns = features (typically word types); cells = feature counts.
lyrics_dfm <- dfm(lyrics_toks)
# Create a dfm from the stemmed tokens to further reduce vocabulary size.
lyrics_dfm_stem <- dfm(lyrics_toks_stem)
# Inspect the first few rows/columns of the stemmed dfm
head(lyrics_dfm_stem)## Document-feature matrix of: 6 documents, 1,161 features (93.10% sparse) and 4 docvars.
## features
## docs come wit microphon explod shatter mold either drop hit like
## text1 4 4 1 1 1 1 1 3 1 1
## text2 2 0 0 0 0 0 0 0 0 0
## text3 0 0 0 0 0 0 0 0 0 4
## text4 0 0 0 0 0 0 0 0 0 0
## text5 0 0 0 0 0 0 0 0 0 1
## text6 0 4 0 0 0 0 0 0 0 0
## [ reached max_nfeat ... 1,151 more features ]
Note that once we create the dfm object, all tokens become lowercase. Now we can check what are the 15 most frequent tokens.
lyrics_dfm_stem %>%
# Compute the top n most frequent features (tokens) in the dfm
textstat_frequency(n = 30) %>%
# Plot the top features as a horizontal bar chart
ggplot(aes(
x = reorder(feature, frequency),
y = frequency,
fill = frequency,
color = frequency
)) +
# Use bars to show counts (alpha makes them slightly transparent)
geom_col(alpha = 0.5) +
# Flip coordinates so feature labels are easier to read
coord_flip() +
# Fix ordering after coord_flip when using reorder()
scale_x_reordered() +
# Map frequency to color/fill gradients for visual emphasis
scale_color_distiller(palette = "PuOr") +
scale_fill_distiller(palette = "PuOr") +
# Clean theme
theme_minimal() +
labs(x = "", y = "Frequency", color = "", fill = "") +
# Hide legend (frequency is already shown on the y-axis)
theme(legend.position = "none")
Does not tell us much, but I used the previous code to check for low-information tokens that I might want to remove from my analysis. We can also see how many tokens appear only once:
only_once <- lyrics_dfm_stem %>%
textstat_frequency() %>%
filter(frequency == 1)## filter: removed 564 rows (49%), 597 rows remaining
length(only_once$feature)## [1] 597
More interesting for text analysis is to count words over time/space. In this case, our ‘space’ can be the artist.
lyrics_dfm_stem %>%
# Compute top features *within each artist* (grouped frequency table)
textstat_frequency(n = 15, groups = c(artist)) %>%
ggplot(aes(
x = reorder_within(feature, frequency, group), # reorder features separately within each facet
y = frequency,
fill = group,
color = group
)) +
geom_col(alpha = 0.5) +
coord_flip() +
# One panel per artist; free scales so each artist's frequency range can differ
facet_wrap(~group, scales = "free") +
# Fix axis ordering after reorder_within() + coord_flip()
scale_x_reordered() +
scale_color_brewer(palette = "PuOr") +
scale_fill_brewer(palette = "PuOr") +
theme_minimal() +
labs(x = "", y = "", color = "", fill = "") +
theme(legend.position = "none")
Interesting. There is not a lot of overlap (apart from one token shared by Megan Thee Stallion and Rage Against the Machine). However, it would be great if we could measure the importance of a word relative to how widely it appears across documents (i.e., normalize by document prevalence). Enter TF-IDF: “term frequency–inverse document frequency.” TF-IDF weighting up-weights relatively rare words–words that do not appear in many documents. By combining term frequency and inverse document frequency, we can identify words that are especially characteristic of a given document within a collection.
lyrics_dfm_tfidf <- dfm_tfidf(lyrics_dfm_stem) # Create a dfm with tf-idf instead of counts
lyrics_dfm_tfidf %>%
# force = TRUE ensures features are computed within groups even if some groups have sparse features
textstat_frequency(n = 15, groups = c(artist), force = TRUE) %>%
ggplot(aes(x = reorder_within(feature, frequency, group), y = frequency, fill = group, color = group)) +
geom_col(alpha = 0.5) +
coord_flip() +
facet_wrap(~group, scales = "free") +
scale_x_reordered() +
scale_color_brewer(palette = "PuOr") +
scale_fill_brewer(palette = "PuOr") +
theme_minimal() +
labs(x = "", y = "TF-IDF", color = "", fill = "") +
theme(legend.position = "none")
If we are building a dictionary, for example, we might want to include words with high TF-IDF values. Another way to think about TF-IDF is in terms of predictive power. Words that are common to all documents have little predictive power and receive a TF-IDF value close to 0. Words that appear in only a relatively small number of documents tend to have greater predictive power and receive higher TF-IDF values. Very rare words are also effectively down-weighted, since they may provide only idiosyncratic information about a single document (i.e., strong “prediction” for one document but little information about the rest). As you will read in Chapters 6–7 of Grimmer et al., the goal is to find the right balance.
Another useful tool (and concept) is keyness. Keyness is a two-by-two association score for features that occur differentially across categories. We can estimate which features are more strongly associated with one category (in this case, one artist) relative to another. Let’s compare Megan Thee Stallion and Taylor Swift.
# Subset the dfm to include only documents after 2006.
# This is a convenient way to focus on a time period where both artists are likely represented.
lyrics_dfm_ts_mts <- dfm_subset(lyrics_dfm_stem, year > 2006)
# Compute keyness statistics (a differential association measure) for each feature.
# - target defines the "focus" group: here, documents where artist == "Taylor Swift".
# - The resulting object ranks features by how strongly they are associated with the target group
# versus the reference group (all other documents in the dfm, here: non–Taylor Swift).
lyrics_key <- textstat_keyness(
lyrics_dfm_ts_mts,
target = lyrics_dfm_ts_mts$artist == "Taylor Swift"
)
# Visualize the most strongly associated (key) features for the target vs. the reference group.
textplot_keyness(lyrics_key)
Similar to what we would have inferred from the TF-IDF graphs. Notice that stemming does not always work as expected. Taylor Swift sings about “shake, shake, shake,” and Megan Thee Stallion sings about “shaking.” However, these still appear as distinct features for the two artists.
2.5 Word Frequency Across Artist
We can do something similar to what we did last week to look at word frequencies. Rather than creating a dfm, we can use the dataset as is and extract some basic information—for example, the average number of tokens by artist.
sample_lyrics %>%
# Tokenize the cleaned lyrics into one-token-per-row (similar in spirit to quanteda tokenization)
unnest_tokens(word, lyrics_clean) %>%
# Count tokens per song
group_by(song) %>%
mutate(total_tk_song = n()) %>%
# Keep one row per song (with its token count)
distinct(song, .keep_all = TRUE) %>%
# Compute the mean tokens per song within each artist
group_by(artist) %>%
mutate(mean_tokens = mean(total_tk_song)) %>%
# Plot token counts per song, faceted by artist
ggplot(aes(x = song, y = total_tk_song, fill = artist, color = artist)) +
geom_col(alpha = 0.8) +
# Add a dashed line for each artist's mean token count
geom_hline(aes(yintercept = mean_tokens, color = artist), linetype = "dashed") +
scale_color_manual(values = wes_palette("Royal2")) +
scale_fill_manual(values = wes_palette("Royal2")) +
facet_wrap(~artist, scales = "free_x", nrow = 1) +
theme_minimal() +
theme(
legend.position = "none",
axis.text.x = element_text(angle = 90, size = 5, vjust = 0.5, hjust = 1)
) +
labs(
x = "",
y = "Total Tokens",
color = "",
fill = "",
caption = "Note: Dashed line shows the mean token count by artist."
)## group_by: one grouping variable (song)
## mutate (grouped): new variable 'total_tk_song' (integer) with 20 unique values and 0% NA
## distinct (grouped): removed 8,958 rows (>99%), 21 rows remaining (removed 0 groups, 21 groups remaining)
## group_by: one grouping variable (artist)
## mutate (grouped): new variable 'mean_tokens' (double) with 4 unique values and 0% NA

Alternatively, we can estimate the frequency of a specific token by song.
lyrics_totals <- sample_lyrics %>%
# take the column lyrics_clean and divide it by words
# this uses a similar tokenizer to quanteda
unnest_tokens(word, lyrics_clean) %>%
group_by(song) %>%
mutate(total_tk_song = n()) %>%
distinct(song,.keep_all=T) ## group_by: one grouping variable (song)
## mutate (grouped): new variable 'total_tk_song' (integer) with 20 unique values and 0% NA
## distinct (grouped): removed 8,958 rows (>99%), 21 rows remaining (removed 0 groups, 21 groups remaining)
# let's look for "like"
lyrics_like <- sample_lyrics %>%
# take the column lyrics_clean and divide it by words
# this uses a similar tokenizer to quanteda
unnest_tokens(word, lyrics_clean) %>%
filter(word=="like") %>%
group_by(song) %>%
mutate(total_like_song = n()) %>%
distinct(song,total_like_song) ## filter: removed 8,934 rows (99%), 45 rows remaining
## group_by: one grouping variable (song)
## mutate (grouped): new variable 'total_like_song' (integer) with 7 unique values and 0% NA
## distinct (grouped): removed 33 rows (73%), 12 rows remaining (removed 0 groups, 12 groups remaining)
We can now join these two data frames together with the left_join() function using the “song” column as the key. We can then pipe the joined data into a plot.
lyrics_totals %>%
left_join(lyrics_like, by = "song") %>%
ungroup() %>%
mutate(like_prop = total_like_song/total_tk_song) %>%
ggplot(aes(x=song,y=like_prop,fill=artist,color=artist)) +
geom_col(alpha=.8) +
scale_color_manual(values = wes_palette("Royal2")) +
scale_fill_manual(values = wes_palette("Royal2")) +
facet_wrap(~artist, scales = "free_x", nrow = 1) +
theme_minimal() +
theme(legend.position="none",
axis.text.x = element_text(angle = 90, size = 5,vjust = 0.5, hjust=1)) +
labs(x="", y = "Prop. of \'Like\'", color = "", fill = "")## left_join: added one column (total_like_song)
## > rows only in x 9
## > rows only in lyrics_like ( 0)
## > matched rows 12
## > ====
## > rows total 21
## ungroup: no grouping variables remain
## mutate: new variable 'like_prop' (double) with 13 unique values and 43% NA

2.6 Final Words
As will often be the case, we won’t be able to cover every single feature that the different packages have to offer, show every object we create, or explore everything we can do with them. My advice is that you go home and explore the code in detail. Try applying it to a different corpus and come to the next class with questions (or just show off what you were able to do).