Text mining of 2016 Presidential Election candidates' tweets

In this project, we will analyze the tweeter data of Donald Trump and Hillary Clinton. The data contains detailed information about their tweets from January 2016 to Nov 29 2016 (for Clinton, the data spans from April 14 2016 to Nov 29 2016). We hope to use this data to find the tweeting pattern of Trump and Clinton such as what are the words they usually use, what are the time periods that they write most of their tweets, etc. To answer these questions, we will need to take the approach of text mining. Hopefully, at the end of this project, we could find out some hints that predict Trump would finally win the election.

1. Dealing with the data

First, we need to import and clean our data. To simplify the analysis, we would only focus on the tweets written in English only.

library(tidyverse)
library(tidytext)
library(stringr)
library(lubridate)
library(DT)
library(RColorBrewer)

tweets <- read_csv("/Users/xuxian/Documents/UCLA related/R/Projects/2016 Presidential election tweet/tweets.csv")

tweets_cleaned <- tweets%>%filter(lang=="en")%>%select(id, handle, text, is_retweet, time,retweet_count,favorite_count)
tweets_cleaned$time<-ymd_hms(tweets_cleaned$time)

After the cleaning, we can take a look at our data here.

tweets_cleaned
## # A tibble: 6,248 x 7
##         id handle text  is_retweet time                retweet_count
##      <dbl> <chr>  <chr> <lgl>      <dttm>                      <dbl>
##  1 7.81e17 Hilla… "The… FALSE      2016-09-28 00:22:34           218
##  2 7.81e17 Hilla… "Las… TRUE       2016-09-27 23:45:00          2445
##  3 7.81e17 Hilla… "Cou… TRUE       2016-09-27 23:26:40          7834
##  4 7.81e17 Hilla… "If … FALSE      2016-09-27 23:08:41           916
##  5 7.81e17 Hilla… "Bot… FALSE      2016-09-27 22:30:27           859
##  6 7.81e17 realD… "Joi… FALSE      2016-09-27 22:13:24          2181
##  7 7.81e17 Hilla… "Thi… FALSE      2016-09-27 21:35:28          1303
##  8 7.81e17 Hilla… "Whe… FALSE      2016-09-27 21:25:31          1833
##  9 7.81e17 realD… "Onc… FALSE      2016-09-27 21:08:22          4132
## 10 7.81e17 Hilla… "3) … TRUE       2016-09-27 21:00:13          1087
## # … with 6,238 more rows, and 1 more variable: favorite_count <dbl>

2. Exploratory data analysis

2.1 Preliminary analysis

Since we have got the cleaned dataset, we are now ready to do some preliminary data analysis. First, we count the number of tweets for both Trump and Clinton. From the summary below, we can see that the total number of tweets for both of them is not that different.

tweets_cleaned%>%count(handle)
## # A tibble: 2 x 2
##   handle              n
##   <chr>           <int>
## 1 HillaryClinton   3117
## 2 realDonaldTrump  3131

Then we try to find out their retweet pattern.

tweets_cleaned%>%count(handle,is_retweet)%>%group_by(handle)%>%mutate(total = sum(n))%>%transmute(handle,is_retweet,prop = n/total,total)
## # A tibble: 4 x 4
## # Groups:   handle [2]
##   handle          is_retweet   prop total
##   <chr>           <lgl>       <dbl> <int>
## 1 HillaryClinton  FALSE      0.818   3117
## 2 HillaryClinton  TRUE       0.182   3117
## 3 realDonaldTrump FALSE      0.961   3131
## 4 realDonaldTrump TRUE       0.0393  3131

We can visualize the information above with barplot.

library(plotly)
## Warning: package 'plotly' was built under R version 4.0.2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
retweet<-tweets_cleaned%>%count(handle,is_retweet)

ggplotly(retweet%>%ggplot(aes(x = handle,y = n,fill = is_retweet))+geom_bar(stat = "identity",position = "fill")+labs(title = "Total number of tweets for Donald Trump and Hillary Clinton", x = "User Name", y = "Total Count")+scale_fill_brewer(palette = "Paired")+
  theme_minimal()+
  theme(legend.position = "top"))

From this graph, we could see that Clinton tends to retweet more than Trump: about 18.2% of Clinton’s tweets is made up of her retweets, but for Trump, only 3.92% of his tweets is retweet.

Then we anlayze the pattern of retweet count and favorite count of their tweets.

fav_retweet <- tweets_cleaned%>%group_by(handle)%>%summarize(ave_retweet = sum(retweet_count)/n(), ave_favorite = sum(favorite_count)/n())
## `summarise()` ungrouping output (override with `.groups` argument)
fav_retweet
## # A tibble: 2 x 3
##   handle          ave_retweet ave_favorite
##   <chr>                 <dbl>        <dbl>
## 1 HillaryClinton        3028.        6814.
## 2 realDonaldTrump       5803.       16673.

We can see that on average, Trump’s tweets are not only retweeted more, but also get more favorite from the users than Clinton. Maybe this could be seen as a hint to the final result of this year’s presidential election.

ggplotly(tweets_cleaned %>% ggplot(aes(x = handle, y = retweet_count))+geom_boxplot(fill = "lightblue")+scale_y_log10()+labs(title = "Total number of retweets of Trump's and Clinton's tweets",x = "User Name",  y = "Total Count")+
  theme_minimal()+
  theme(legend.position = "top"))
ggplotly(tweets_cleaned %>% ggplot(aes(x = handle, y = favorite_count))+geom_boxplot(fill = "pink")+scale_y_log10()+labs(title = "Total number of favorite of Trump's and Clinton's tweets",x = "User Name",  y = "Total Count")+
  theme_minimal()+
  theme(legend.position = "top"))

2.2 Time span analysis

In this section, we focus on how Trump’s and Clinton’s tweeting pattern change over time.

The graph below shows how the total number of Clinton’s and Trump’s tweets changes.

tweets_cleaned$date <- ymd(as.Date(tweets_cleaned$time))

ggplotly(tweets_cleaned%>%count(handle, date)%>%ggplot(aes(x = date, y = n, color = handle))+geom_line()+labs(title = "Total number of Trump's and Clinton's tweets across time",x = "Date",  y = "Total Count", color = "User Name")+scale_color_brewer(palette = "Paired")+
  theme_minimal()+
  theme(legend.position = "top"))

Notice that Trump’s total number of tweets per day doesn’t vary too much across time, but Clinton clearly tweets more when approaching the date of presidential election.

We can also find out how the favorite and retweet count change over time for both of them.

ggplotly(tweets_cleaned%>%group_by(handle, date)%>%summarise(ave_retweet = sum(retweet_count)/n(), ave_favorite = sum(favorite_count)/n())%>%ggplot(aes(x = date, y = ave_retweet, color = handle))+geom_line()+labs(title = "Average retweet count of Trump's and Clinton's tweets across time",x = "Date",  y = "Average retweet count", color = "User Name")+scale_color_brewer(palette = "Paired")+
  theme_minimal()+
  theme(legend.position = "top"))
## `summarise()` regrouping output by 'handle' (override with `.groups` argument)
ggplotly(tweets_cleaned%>%group_by(handle, date)%>%summarise(ave_retweet = sum(retweet_count)/n(), ave_favorite = sum(favorite_count)/n())%>%ggplot(aes(x = date, y = ave_favorite, color = handle))+geom_line()+labs(title = "Average favorite count of Trump's and Clinton's tweets across time",x = "Date",  y = "Total Count", color = "User Name")+scale_color_brewer(palette = "Paired")+
  theme_minimal()+
  theme(legend.position = "top"))
## `summarise()` regrouping output by 'handle' (override with `.groups` argument)

From the two graphs above, we can see that except for the date June 9 2016, both of the average retweet and favorite count of Trump’s tweets are greater than that of Hillary. In order to find out what happened at that day, we can filter out the tweets of that particular day.

tweets_cleaned%>%filter(date == ymd("2016-06-09"), is_retweet==FALSE)%>%select(handle,text,favorite_count,retweet_count)%>%arrange(desc(favorite_count))%>%datatable()

After studying their tweet pattern across month, we could focus on smaller time units like weekday and hour.

The diagram below shows the number of tweets Clinton and Trump wrote in a one-hour interval.

ggplotly(tweets_cleaned%>%mutate(hour = as.factor(hour(time)))%>%count(handle,hour)%>%ggplot(aes(x = hour, y = n, fill = handle))+geom_col(position = "dodge")+scale_fill_brewer(palette = "Paired")+labs(title = "Total number of Trump's and Clinton's tweets in a day",x = "Hour",  y = "Total Count", fill = "User Name")+
  theme_minimal()+
  theme(legend.position = "top"))

In this graph, we see that Trump usually sent way more tweets from 3 a.m. to 1 p.m. than Clinton. In addition, I feel suprised seeing Clinton wrote more tweets than Trump during the midnight.

Then we plot the their tweeting pattern in a week.

ggplotly(tweets_cleaned%>%mutate(weekday = as.factor(wday(time,label = TRUE)))%>%count(handle,weekday)%>%ggplot(aes(x = weekday, y = n, fill = handle))+geom_col(position = "dodge")+labs(title = "Total number of Trump's and Clinton's tweets in a week",x = "Weekday",  y = "Total Count", fill = "User Name")+scale_fill_brewer(palette = "Paired")+
  theme_minimal()+
  theme(legend.position = "top"))

From this graph, we see clearly that Hillary sends more tweets during the weekdays, while Trump sends out more tweets during the weekends.

We could combine the two graphs above together into a heatmap.

ggplotly(tweets_cleaned%>%mutate(weekday = as.factor(wday(time,label = TRUE)),hour = as.factor(hour(time)))%>%count(handle,weekday,hour)%>%ggplot(aes(x = weekday, y = hour, fill = n))+geom_tile()+scale_fill_gradient(low = "yellow", high = "red")+labs(title = "Distribution of Trump's and Clinton's tweets across time",x = "Weekday",  y = "Hour", fill = "Total count")+facet_wrap(~handle)+
  theme_minimal()+
  theme(legend.position = "top"))
## Warning: `group_by_()` is deprecated as of dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

3. Text mining

In this section, we focus on the word usage for Trump’s and Clinton’s tweets. In order to do so, we need to tokenize the text body of tweets into single word.

tweet_token<-tweets_cleaned%>%unnest_tokens(word,text)%>%anti_join(stop_words)%>%filter(!word%in%c("https","t.co"))
## Joining, by = "word"
tweet_token
## # A tibble: 52,603 x 8
##         id handle is_retweet time                retweet_count favorite_count
##      <dbl> <chr>  <lgl>      <dttm>                      <dbl>          <dbl>
##  1 7.81e17 Hilla… FALSE      2016-09-28 00:22:34           218            651
##  2 7.81e17 Hilla… FALSE      2016-09-28 00:22:34           218            651
##  3 7.81e17 Hilla… FALSE      2016-09-28 00:22:34           218            651
##  4 7.81e17 Hilla… FALSE      2016-09-28 00:22:34           218            651
##  5 7.81e17 Hilla… FALSE      2016-09-28 00:22:34           218            651
##  6 7.81e17 Hilla… FALSE      2016-09-28 00:22:34           218            651
##  7 7.81e17 Hilla… TRUE       2016-09-27 23:45:00          2445           5308
##  8 7.81e17 Hilla… TRUE       2016-09-27 23:45:00          2445           5308
##  9 7.81e17 Hilla… TRUE       2016-09-27 23:45:00          2445           5308
## 10 7.81e17 Hilla… TRUE       2016-09-27 23:45:00          2445           5308
## # … with 52,593 more rows, and 2 more variables: date <date>, word <chr>

3.1 Word analysis

First, let’s take a look at the mostly used words for both of the presidential candidates.

tweet_token%>%count(handle,word)%>%group_by(handle)%>%top_n(10)%>%arrange(desc(n))
## Selecting by n
## # A tibble: 20 x 3
## # Groups:   handle [2]
##    handle          word                      n
##    <chr>           <chr>                 <int>
##  1 HillaryClinton  trump                   710
##  2 HillaryClinton  hillary                 668
##  3 HillaryClinton  donald                  422
##  4 realDonaldTrump trump                   366
##  5 realDonaldTrump hillary                 342
##  6 realDonaldTrump realdonaldtrump         322
##  7 realDonaldTrump trump2016               318
##  8 HillaryClinton  president               279
##  9 realDonaldTrump amp                     268
## 10 realDonaldTrump people                  225
## 11 realDonaldTrump makeamericagreatagain   223
## 12 realDonaldTrump america                 210
## 13 realDonaldTrump cruz                    198
## 14 HillaryClinton  america                 196
## 15 realDonaldTrump clinton                 192
## 16 HillaryClinton  people                  191
## 17 HillaryClinton  trump's                 149
## 18 HillaryClinton  families                143
## 19 HillaryClinton  potus                   141
## 20 HillaryClinton  americans               129

We can visualize the information above in two ways.

ggplotly(tweet_token%>%count(handle,word)%>%group_by(handle)%>%top_n(20)%>%arrange(desc(n))%>%ggplot(aes(x = n, y = reorder(word,n), fill = handle))+geom_col()+facet_wrap(~handle,scales = "free_y")+scale_fill_brewer(palette = "Paired")+labs(title = "Mostly used words by Trump and Clinton",x = "Count",  y = "Words", fill = "Username")+
  theme_minimal()+
  theme(legend.position = "top"))
## Selecting by n
library(reshape2)
library(wordcloud)
tweet_token%>%count(handle,word, sort = TRUE)%>%acast(word ~ handle, value.var = "n", fill = 0) %>%comparison.cloud(colors = c("lightblue", "pink"),max.words = 150)

3.2 tf-idf

Next, we analyze the words used uniquely by each of them by using tf-idf.

tweet_token%>%count(handle,word)%>%bind_tf_idf(word, handle, n)%>%arrange(desc(tf_idf))%>%group_by(handle)%>%top_n(20)%>%ggplot(aes(x = tf_idf,  y = reorder(word, tf_idf), fill = handle))+geom_col()+facet_wrap(~handle, scales = "free")+scale_fill_brewer(palette = "Paired")+labs(title = "Words with high tf-idf used by Trump and Clinton",x = "Count",  y = "Word", fill = "Username")+
  theme_minimal()+
  theme(legend.position = "top")
## Selecting by tf_idf

By studying the tf-idf, we could gain some insight about what are the issues the two presidential candidates focus the most.

4. Sentiment analysis

First, we can find out the positive and negative words mostly used by Trump and Clinton.

sent <- tweet_token%>%inner_join(get_sentiments("afinn"))
## Joining, by = "word"
sent%>%group_by(handle,word)%>%summarise(contribution = n()*mean(value))%>%group_by(handle)%>%arrange(desc(abs(contribution)))%>%top_n(20) %>% ggplot(aes(x = contribution, y = reorder(word,contribution), fill = handle))+geom_col()+facet_wrap(~handle, scales = "free_y")+scale_fill_brewer(palette = "Paired")+labs(title = "Top 20 positive words used by Trump and Clinton",x = "Contribution",  y = "Word", fill = "Username")+
  theme_minimal()+
  theme(legend.position = "top")
## `summarise()` regrouping output by 'handle' (override with `.groups` argument)
## Selecting by contribution

sent%>%group_by(handle,word)%>%summarise(contribution = n()*mean(value))%>%group_by(handle)%>%top_n(-20) %>% ggplot(aes(x = contribution, y = reorder(word,contribution), fill = handle))+geom_col()+facet_wrap(~handle, scales = "free")+scale_fill_brewer(palette = "Paired")+labs(title = "Top 20 negative words used by Trump and Clinton",x = "Contribution",  y = "Word", fill = "Username")+
  theme_minimal()+
  theme(legend.position = "top")
## `summarise()` regrouping output by 'handle' (override with `.groups` argument)
## Selecting by contribution

After getting familiar with the postive and negative word usage of Trump and Clinton, we can further visualize the change in their tweets’ sentimental level across time.

total_sent <- sent%>%group_by(id)%>%summarise(total_sent = sum(value,na.rm = TRUE))%>%arrange(desc(total_sent))
## `summarise()` ungrouping output (override with `.groups` argument)
tweet_sent <- tweet_token%>%left_join(total_sent, by = "id")

tweet_sent_date <- tweet_sent%>%group_by(handle, date)%>%summarise(av_sent_date = mean(total_sent,na.rm = TRUE))
## `summarise()` regrouping output by 'handle' (override with `.groups` argument)
ggplotly(tweet_sent_date%>%ggplot(aes(x = date , y =av_sent_date, color = handle))+geom_line()+geom_smooth(se = FALSE)+scale_color_manual(values = c("lightblue","pink"))+labs(x = "Date", y = "Average Sentiment Level", title = "Change in tweets' sentiment level")+
  theme_minimal()+
  theme(legend.position = "top"))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 6 rows containing non-finite values (stat_smooth).
Xuxin Zhang
Xuxin Zhang

Just a wondering village boy.

Related