# The Claire Network

### Analyzing the relationship between life events and facebook post sentiments

I’ll admit it: I used to post a LOT on facebook. Eight thousand three hundred and sixty-four posts from 2012 through 2019, to be exact. While there are many theories why I was an oversharer (and proof I’m not alone!), I’m going to focus here on utilizing this wealth of data to see if I can find a relationship between personal life events and subsequent post sentiments. In line with my reformed ways, I’m not revealing what most of these life events are, so please feel free to use your imagination.

### Data wrangling and sentiment analysis

Facebook allows you to download your data as a JSON file. For this project, I asked a friend to take the posts I made on my own timeline and turn them into a csv file called my_posts, with each row being one post. Let’s load our required libraries, then do some data wrangling. I’ll include all the steps that might be useful in case you want to try this with your own data.

library(stringr)
library(lubridate)
library(tidytext)
library(sentimentr)
library(tm)
library(infer)
library(tidyverse)
library(viridis)

my_posts <- my_posts %>%
#fixing apostrophes
mutate(text = str_replace(text, "â", "'")) %>%
#converting the Unix timestamp to a useful datetime variable
mutate(datetime = as_datetime(timestamp),
#adding date variables, for playing around with different timeframes
date = date(datetime),
year = year(date),
month = month(date)) %>%
#selecting only the years where I have full data
filter(year %in% seq(2012, 2019)) %>%
mutate(ID = row_number())

Now that we have a clean dataset, we can find the sentence-level text sentiment using sentimentr. However, some of my posts are more than one sentence long. What we have to do is unnest the posts into sentences, find the sentence-level sentiment, then re-combine the sentences into their posts and use the mean sentiment for that post.

sentences <- my_posts %>%
unnest_tokens(output = sentence, input = text, token = "sentences") %>%
get_sentences() %>%
sentiment()

Now we have a dataset that looks like this…

… with the key new variable being sentiment: negative scores for negative sentiments, and positive scores for positive sentiments. But wait, what’s up with the sentences with a sentiment of exactly 0?

sentences %>%
filter(sentiment == 0) %>%
select(sentence, sentiment) %>%
glimpse()

(Salmi is my dog.)

I wouldn’t classify the posts I’ve found here as perfectly neutral, but no package is perfect. I went ahead and did the work of comparing our later analyses with and without these posts with a sentiment of zero, and the overall relationships didn’t change much. We’ll just remove them in the next step.

Now we’ll go ahead and recombine these into posts, and find the mean sentiment of each post by finding the mean of its component sentence sentiments.

by_post <- sentences %>%
group_by(year, month, date, timestamp, ID) %>%
summarize(post_sentiment = mean(sentiment)) %>%
filter(post_sentiment != 0)

#let's recombine it with the my_posts dataset to get the text information back

by_post <- inner_join(my_posts, by_post, by = c("ID", "year", "month", "timestamp", "date"))

We’ll set this aside for now.

Now let’s create a dataset of what I think are my most significant life events between 2012 and 2019.

life_events <- data.frame(
text = c("P1", "P2", "P3", "N1", "N2"),
date = c("2014-06-15", "2015-08-25", "2016-03-02", "2016-08-25", "2017-10-07"),
year = c(2014, 2015, 2016, 2016, 2017),
month = c(6, 8, 3, 8, 10),
valence = c("Positive", "Positive", "Positive", "Negative", "Negative"))

life_events$date <- as_date(life_events$date)

### Combining life events and posts

Let’s see if we can spot any trends with the naked eye.

ggplot(by_post, aes(x = date, y = post_sentiment)) +
geom_point() +
geom_vline(life_events, mapping = aes(xintercept = date, color = valence)) +
...

Let’s instead try taking advantage of the lubridate package to group posts and life events by the nearest month, then we’ll try this again.

months <- by_post_joined %>%
mutate(nearest_month = round_date(datetime, unit = "month"),
nearest_month = as_date(nearest_month)) %>%
group_by(nearest_month, year) %>%
summarize(mean_month_sent = mean(post_sentiment))

life_events <- life_events %>%
mutate(nearest_month = round_date(date, unit = "month"))

ggplot(months, mapping = aes(x = nearest_month, y = mean_month_sent)) +
geom_point() +
geom_vline(life_events, mapping = aes(xintercept = date, color = valence)) +
geom_abline(slope = 0, intercept = baseline$mean, color = "purple") + #adding in a polynomial line of best fit, which includes standard error geom_smooth() + geom_abline(slope = 0, intercept = 0) + ... Much better! We can see there’s a pretty clear pattern here. Overall, my posts are generally positive. There’s a steady increase in post sentiment as we move through positive life events 1-3, which begins to flatten after negative life event 1 and decreases after negative life event 2. The purple line is my baseline post sentiment, which I got by removing all of the posts within 45 days of each event and then taking the mean. ### Let’s get quantitative! Now let’s take a quantitative approach to our main question: do my post sentiments change after life events? Also, if so, how long does this effect last? We’re going to use paired ANOVA tests for this. So, we’re going to compare the mean post sentiment of posts one day after the event to the mean post sentiment of posts one day before the event, and ask if there’s a difference. We’ll repeat this for two days out, three days out, etc. until 45 days. We’ll also do this for each life event. I’ll go ahead and spoil it: there’s just random variation for positive life event 1. We can say then that positive life event 1 did not have an effect on following post sentiments, compared to post sentiments from the same time period before the event. However, it’s imporant to note that this event was an estimated date. But how did I find this? Let’s get started with positive life event 2, where we’ll see some interesting results. I don’t want to find the ANOVA test statistic and p-value for each time period (one day, two days…) individually, so I’m going to write a for loop. First though, I’ll walk you through the guts of the for loop by running the code for just one day. #grabbing just P2 p1 <- life_events %>% arrange(date) %>% slice(2) p2 <- p2 %>% #creating intervals of after and before the event mutate(interval_after = interval(date + ddays(1), date + ddays(1)), interval_before = interval(date - ddays(1), date - ddays(1)), interval_before = int_flip(interval_before)) P2, now with intervals for the time period we’re interested in. #taking our dataset with the mean post sentiments after <- by_post %>% #finding which posts fall within our interval of interest, using the interval we just made filter(date %within% p2$interval_after) %>%
#adding a new variable, which will be our explanatory variable for ANOVA
mutate(timeline = "After")

before <- by_post %>%
filter(date %within% p2$interval_before) %>% mutate(timeline = "Before") #combining the datasets combined <- rbind(before, after) mod <- aov(post_sentiment~timeline, combined) mod <- tidy(mod) ANOVA for life event P2, 1 day before/after. Great, now let’s use a for loop to repeat this for days 1-45. We’ll also store the test statistics and p-value for each time period, as well as the mean post sentiment before and after and their difference. (Remember, the ANOVA test stat doesn’t tell you which mean is greater, just the magnitude of the difference.) #creating a blank dataframe to store for loop in b <- 45 21_aov_results <- data.frame(test_stat = rep(NA, b), p_value = rep(NA, b), day = (1:45), mean_before = rep(NA, b), mean_after = rep(NA, b)) #here, 'i' is whatever number the for loop is on #for the first loop, R says i = 1, then it repeats with i = 2, ... through i = 45 for(i in 1:45){ p2 <- p2 %>% mutate(interval_after = interval(date + ddays(1), date + ddays(i)), interval_before = interval(date - ddays(1), date - ddays(i)), interval_before = int_flip(interval_before)) after <- by_post_joined %>% filter(date %within% p2$interval_after) %>%
mutate(timeline = "After")

before <- by_post_joined %>%
filter(date %within% p2$interval_before) %>% mutate(timeline = "Before") #finding and storing the means for each time period i P2_aov_results$mean_after[i] <- after %>%
summarize(mean = mean(post_sentiment)) %>%
pull()

P2_aov_results$mean_before[i] <- before %>% summarize(mean = mean(post_sentiment)) %>% pull() combined <- rbind(before, after) mod <- aov(post_sentiment~timeline, combined) mod <- tidy(mod) #grabbing out only the test statistic and p_value from mod #and storing them in the blank dataframe we made before P2_aov_results$test_stat[i] <- as.numeric(mod[1,5])
P2_aov_results\$p_value[i] <- as.numeric(mod[1,6])
}

#finding the differences
#a positive difference means the event had a positive effect
P2_aov_results <- P2_aov_results %>%
mutate(difference = mean_after - mean_before)

Now we get a dataframe that looks like this. This tells us the direction of the difference of means as well as their magnitude.

### Results

Let’s visualize this! I’m deliberately choosing to keep the p-value on a continuous scale, rather than succumb to the “only p < 0.05 matters” myth.

ggplot(P2_aov_results, aes(x = day, y = test_stat, fill = p_value)) +
geom_col() +
scale_fill_viridis_c(option = "B", limits = c(0, 1)) +
...

For positive event 2, which is the first day of my freshman year at Reed, we see a large and long-lasting effect. If we look at the difference for my mean post sentiments at the interval when the effect was greatest, which was 11 days, we can see it’s a positive change.

P2_aov_results %>%
#sorting by test stat, in descending order
arrange(desc(test_stat)) %>%
#taking the top row
slice(1)

Let’s repeat this analysis for all the other life events. The code is the same for loop again, just with a different row from the life_events dataset.

It looks like negative event 1, which was the other estimated date, was also just random noise. However, we have interesting results for the rest of the data!

We have a different result for positive event 3. It looks like there’s an initial spike at the interval 1 day before/after the event with a positive difference of 0.21 between post sentiments, then posts quickly fall back to baseline. We can say from this that either the positive effect of the event was extremely short-lived, or the days before and after the event were just too similar (maybe anticipation?).

Lastly, we’ll look at the effect of negative event 2. This one was particularly interesting to me.

When I compared the means, all of the mean post sentiments after negative event 2 were actually more positive than before. I wonder what this says about my psychological response to a sad incident; maybe a defense mechanism? Anyway, there is an extreme difference between the before and after posts immediately following the event, then the difference between sentiments before and after becomes neglible, followed by a more subdued but still significant period of positive sentiments. The first period lasts 2 to 19 days after the event, and the second period lasts about 40 to 72 days after the event.

Looking at the paired means though reveals a flaw in this method. My ‘after’ post sentiments remain relatively constant from 2 to 72 days after the event, but there is a period of positive posting 20 to 39 days before the event that makes the difference much smaller. This suggests the dampened effect of the event during this time period isn’t due to the effect of the event lessening, but what we’ve chosen as our baseline possibly being affected by a second event.

Although these analyses may not be perfect (can they ever be?), they do show that positive and negative events have the ability to change the sentiments of my facebook posts. I hope you’ve found this interesting, and maybe inspiring to analylze your own facebook posts.