Poetry R

When Poetry Meets R

Poetry analysis can be a nightmare for many students. We often hear complaints like “poetry is dull and obscure and takes too long to analyze”. Luckily, with R, we can take a new approach to poetry analysis and make this process simple, colorful and interactive!

In this blogpost, we will directly compare some selected poems to get a sense of the poetic movement in the past centuries/decades (yes, without having to read all of them!). In particular, I will focus on feminist poetry written by women poets from different eras as an example to guide you through a series of questions.

How does poetry reflect the feminism movement?

Let’s start with a brief historical background of feminism movement.

The first wave of feminism took place in the mid-19th century and early 20th centuries (1840s-1920s) and is mostly centered on women’s right to vote. From the 1950s and 1980s, the second wave of feminism mainly focused on challenging gender roles in society, including women’s job opportunities, educational equality, and reproductive rights. Finally, the third (1990s) and fourth (2012-2018) wave of feminism, often classified together, priortized acceptance of female sexuality, body positivity, and sexual assault awareness (Source).

With this background information in mind, let’s see if we can identify some recurring themes underlying poems written during these waves.

Data Source and Preparation

Poetry Foundation once put together a collection called Poetry and Feminism to explore English poetry’s relationship to questions raised by feminism movements. We will scrape and use all the poems from this particular collection in our analyses of 4 stages of feminist movements: predesessors, first-wave, second-wave, and third/fourth-waves.

In particular, I’ll use rvest for web scraping, tidyverse for data transformation, analysis, and visualization, and tidytext for text mining (some text processing procedures are inspired by a blogpost on WWI poetry analysis). Additionally, we will use tidygraph + visNetwork to make an interactive network graph and wordcloud to visualize keywords.

# load the necessary libraries
library(tidyverse) 
library(wordcloud)
library(rvest)
library(tidytext)
library(visNetwork)
library(tidygraph)
library(extrafont)
library(ggthemes)
library(RColorBrewer)
library(knitr)

Grabbing the URLs

First, we need to gather the URLs of all the poems featured in the collection. In this process, we will also need to exclude poems in irrelevant formats (e.g. proses, audio poems, and articles). In particular, we are using html_nodes to extract pieces we want out of the URLs and read_html to read in the content.

# read in the content from the URL to the feminist poetry collection
feminism_url <- read_html("https://www.poetryfoundation.org/collections/146073/poetry-and-feminism")

# put URLs of all the pieces from the collection into a list
feminism_list <- feminism_url %>% 
  html_nodes("div.c-hdgSans.c-hdgSans_7 a") %>% 
                        html_attr("href")

# select URLs with poems only (exclude proses, audio poems, and articles)
feminism_poem_list <- feminism_list[grepl("/poems/", feminism_list)]

# read in the contents from all the feminist poems' URLs
feminism_poems <- lapply(feminism_poem_list, read_html)

Creating a tidy data frame

Next, we will define a series of functions to get text, title, and author from the URLs. These functions will make it easier to build the data frame we need later.

# define a series of functions to get text, title, and author
get_text <- function(url) {
  text <- url %>% 
    html_nodes("div.c-feature-bd div") %>% 
    html_text()
  text <- paste(text[!grepl("\n",text)], collapse = " ")
}

get_title <- function(url) {
  title <- url %>% 
    html_nodes('div.c-feature-hd h1') %>% 
    html_text()
}

get_author <- function(url) {
  author <- url %>% 
    html_nodes("div span") %>% 
    html_text()
  author <- author[grepl("By ",author)][1]
  author <- gsub("\n ","",author)
  author <- trimws(gsub("By ","",author))
}

Here, we finally get to create a large data frame with text, title, author, and wave (which feminism wave the poem belongs to), and each row represents a single poem. This data frame has 89 usable poems for analysis (after removing poems in unreadable formats).

# create a data frame
feminism_df <- data.frame(text = as.character(lapply(feminism_poems, get_text)),
                      title = as.character(lapply(feminism_poems, get_title)),
                      author = as.character(lapply(feminism_poems, get_author)))

# remove poems without readable texts
feminism_df <- feminism_df %>%
  filter(text != "")

# add different waves to feminism_df
feminism_df$wave <- NA
feminism_df$wave[1:14] <- "Predecessors"
feminism_df$wave[15:30] <- "1st Wave"
feminism_df$wave[31:69] <- "2nd Wave"
feminism_df$wave[70:89] <- "3rd & 4th Wave"

feminism_df <- feminism_df %>%
  mutate(text = as.character(text), author = as.character(author), title = as.character(title))

# a glimpse of the data frame
glimpse(feminism_df)
## Observations: 90
## Variables: 4
## $ text   <chr> "To sing of Wars, of Captains, and of Kings,  Of Cities founde…
## $ title  <chr> "\n                    Prologue\n                ", "\n       …
## $ author <chr> "Anne Bradstreet", "Anne Bradstreet", "Countess of Winchilsea …
## $ wave   <chr> "Predecessors", "Predecessors", "Predecessors", "Predecessors"…

What’s universal? What’s unique?

Now we have all the textual data ready to be analyzed! First, let’s focus on individual words and see what are some common words used in all the feminist waves.

In the first step of analysis, we will tokenize the text into individual words using unnest_tokens and remove all the stopwords such as “the” and “a” by anti_join -ing our data frame with stop_words. We will also need to manually remove some irrelevant words (e.g. “it’s”, “don’t”, “thou”), numbers, and repetitive names.

# create a feminism dataframe with tokenized words (excluding stop_words)
feminism_words_no_stopword <- feminism_df %>%
  unnest_tokens(output = word, input = text, token = "words") %>%
  anti_join(stop_words)

# write a negation function for later use
`%nin%` = Negate(`%in%`)

# remove words that contain digits, names and stop words (not detected in `stop_words` df)
feminism_words_no_stopword <- feminism_words_no_stopword %>%
  filter(word %nin% c("laura", "lizzie", "thy", "thou", "emily’s",
                      "emily","charlotte","don’t", "it’s")) %>%
  filter(str_detect(word, "[:digit:]")!= TRUE) 

Creating a wordcloud

Now, we can arrange words by frequency and use wordcloud to visualize the pattern of words appeared in all waves (we will only inclue words with frequency over 5 here) to find out what themes are universial.

# arrange words by frequency
wordcloud <- feminism_words_no_stopword %>%
  group_by(word) %>%
  summarize(freq = n()) %>%
  arrange(desc(freq)) %>%
  as.data.frame() 

# choose a palette
pal <- brewer.pal(6, "Spectral")

# create a wordcloud
wordcloud %>%
  with(wordcloud(word, freq, colors = pal,
       min.freq = 5, random.order = FALSE))

From this wordcloud, we can easily spot some general themes shared by all feminist waves such as “love”, “night”, and “life”.

Also, the most common words seem to appear in similar categories: body related (“eyes”, “hands”, “heart”, “head”, “body”), race-related (“white”, “black”), gender-role related (“woman”, “father”, “mother”), and nature-related (“water”,“sun”, “air”, “fire”, “moon”, “sky”).

Visualizing word patterns in separate waves

However, as said in the brief background section, each wave also has a distinct focus. In the following analysis, we will use a barchart to compare words across different waves and identify the pattern of words unique to each movement.

Here, we will first use group_by to get the most common words by wave, then use pivot_wider to transform the data into a wide format with wave as columns, and finally calculate a total count for each word across waves.

# group by word and wave
word_count <- feminism_words_no_stopword %>%
  group_by(word, wave) %>%
  summarize(count = n()) 

# widen the dataframe (turn different waves into individual columns)
word_count <- word_count %>% 
  pivot_wider(names_from = wave, 
              values_from = count,
              values_fill = list(count = 0))

# create another column containing total count of words and only include words with total > 15
word_count$total <- rowSums(word_count[,2:5]) 

word_count <- word_count %>%
  filter(total >=15)

# transform the data back to the long format to prepare the graph
word_count <- word_count %>% 
  pivot_longer(cols= -word, names_to = "wave")

# store the waves as factors
word_count$wave <- factor(word_count$wave,
                          levels = c("Predecessors", "1st Wave", 
                                     "2nd Wave", "3rd & 4th Wave"))

# get rid of some irrelevant words and make a barchart
word_count %>% 
  filter(value > 2,
         wave != "total") %>%
  ggplot(aes(x = word, y = value, group = wave, fill = wave)) +
  facet_grid(cols = vars(wave)) + 
  theme_minimal() + 
  geom_col() + 
  coord_flip() +
  scale_fill_manual(values = c('mistyrose2','indianred2','indianred3','indianred4')) +
  labs(title = 'Most common words across feminism waves', y = 'Count of words',x = '') +
  theme_fivethirtyeight() 

Comparing across waves, the plot indicates that the 2nd-wave shared most words with the 3rd/4th waves (however, it’s important to note that we have fewer poems from predecessors and 1st wave in our dataset).

In particular, there is a surge in word “black” in the 2nd, 3rd & 4th waves, which might reflect an increase in the involvement of minority groups in feminism movement after the first wave (that mainly consisted of white advocates). Also, “body”, “nude”, and “blood” appear more frequently in the last two waves - this trend is similar to the rising attention on body positivity and expression of sexuality in the recent feminism waves.

Bigram Visualization: The “Woman-Network”

Another common text processing technique is to tokenize text into pairs of two consecutive words called bigrams, which would be useful to analyze the relationship between words.

Inspired by a blogpost on lyrics analysis, in this step, we will will use as_tbl_graph and visIgraph together to create a network graph of words appearing before or after the keyword “woman” to explore how the word “woman” is used in the poems.

In particular, we will make words that appear after “woman” have a pink line, and words that appear before “woman” have a black line. Also, more common combinations will have thicker lines and closer to the center.

# tokenize text into bigrams
feminism_bigram <- feminism_df %>%
  unnest_tokens(bigram, text, 
                token = "ngrams", n = 2) 

# separate bigrams into word 1 and word 2 (after removing non-words, i.e. numbers)
bigram_separated <- feminism_bigram %>% 
  filter(str_detect(bigram, "[:digit:]")!= TRUE) %>%
  separate(bigram, c("word1", "word2"), sep = " ")

# only keep word pairs that contain the word "woman" 
bigram_separated_woman <- 
  bigram_separated %>% 
  filter(word1 == "woman" | word2 == "woman") %>% 
  count(word1, word2, sort = TRUE) %>%
  as.data.frame()

# prepare for a network graph
woman_graph_data <- 
     bigram_separated_woman %>%
     as_tbl_graph() %>% 
       mutate(color.background = if_else(name == "woman", "hotpink", "black"),
              color.border = if_else(name == "woman", "hotpink", "black"),
              label = name,
              labelHighlightBold = TRUE,
              size = if_else(name == "woman", 70, 25),
              font.face = "Courier",
              font.size = if_else(name == "woman", 70, 40),
              font.color = if_else(name == "woman", "hotpink", "black"),
              shape = "star") %>% 
      activate(edges) %>% 
       mutate(hoverWidth = n,
              selectionWidth = n,
              scaling.max = 30)

# make an interactive graph
visIgraph(woman_graph_data) %>%
  visOptions(highlightNearest = list(enabled = T, hover = T), 
             nodesIdSelection = T)

From this network graph, we can observe that adjectives used to describe “woman” are inclusive of women from different groups (“young”, “old”, “white”, and “black”). Some words used before “woman” have negative connotations (“slave”, “lonely”, and “injured”), while others have empowering and positive connotations (“tall”, “loyal”, “noblest”).

The verbs used after “woman” mostly implied an urgency for action (“should”, “would”), rebellion (“rise”, “screamed”, “stands”), and suffering (“grimaces”, “falls”, “cannot”).

Readability and sentiment

Some poems seem to be more obscure and difficult to understand than others. The difficulty of reading a passage, or readability, is often measured using Flesch’s Reading Ease Score in computational linguistics. Flesch’s score is calculated on the basis of sentence length and the length of the words within the sentence, and a higher score indicates more reading fluency.

In the example below, we will investigate individual poems with varying degrees of readability. In particular, from our data frame, we will look at the most readable poem, the least readable poem, and an additional poem with a readability score in between. We will use textstat_readability from the library quanteda to help us calculate a readability score (Flesch’s measure) for every poem to make our selection.

library(quanteda)
# create a new data frame to compute a readability score for each poem
readability <- textstat_readability(as.character(feminism_df$text))

# create a column in the original `feminism_df` to match each poem with its readability score
feminism_df <- feminism_df %>%
  mutate(readability_score = readability$Flesch)

# choose the most & the least readable poem & a poem in between
selected_readability <- feminism_df %>%
  arrange(desc(readability_score)) %>%
  filter(row_number() %in% c(1, 37, 89)) %>%
  mutate(length = str_length(text)) %>%
  select(title, wave, readability_score, length)

# show the three poems selected for analysis
kable(selected_readability, digits = 2)
title wave readability_score length
The friend 2nd Wave 102.28 396
San Sepolcro 2nd Wave 76.34 1211
Planetarium 2nd Wave -159.75 1603

After processing, we obtained a poem with the highest readability score, The Friend“ (102.28), one with the lowest readability score, Makeup on Empty Space” (-1007.16), and one with a readability score in between, “San Sepolcro” (76.34). These poems were all written during the 2nd-wave.

Is there a relationship between readability and emotional complexity? To analyze sentiment of an individual poem, we can use the nrc lexicon to assign a sentiment (one of the basic emotions) to each word in the text.

# get the nrc data frame
nrc <- get_sentiments("nrc")

# put the three poems into a data frame
compare <- feminism_df %>%
  arrange(desc(readability_score)) %>%
  filter(row_number() %in% c(1, 37 ,89))

# joining the data frame and nrc
compare <- compare %>%
  unnest_tokens(output = word, input = text,
                token = "words") %>%
  inner_join(nrc, by = "word") 

# make a bar graph for the number of different sentiments
compare %>%
  group_by(title) %>%
  count(sentiment, title) %>%
  filter(sentiment %nin% c("positive", "negative")) %>%
  mutate(n) %>%
  ggplot(aes(y = n, x  = sentiment)) +
  geom_col(aes(fill = sentiment)) +
  theme_fivethirtyeight() +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.6),
        legend.position = "bottom") +
  scale_fill_brewer(palette = "Pastel1") +
  facet_wrap(~title) +
  labs(title = "Sentiment analysis of poems",
       subtitle = "from left to right: low, middle, and high readability")

Having this side-by-side sentiment comparison, it’s easy to notice that the most readable poem (“The Friend”) has the fewest types of sentiments, the poem in between (“San Sepolcro”) has slightly more types of sentiments, whereas the least readable poem (“Makeup on Empty Space”) has the most diverse range of emotions.

It’s possible that this increasing complexity in sentiment is related to the complexity of reading and processing. However, looking at the total count of emotions for each poem, we should also note that the readability of the text seems to be associated with the total length of the text, such that the lengthy ones (those with more higher bars) might be considered as less readable.

Conclusion

To sum up, we can use different visualizations (e.g. barchart, network graph, wordclouds, etc.) to investigate common/unique themes, relationships between words, and sentiment of selected individual poems. I hope some processing steps in this blogpost can inspire you to analyze poetry differently!