Data analysis on "Friends"

1. Introduction

In this project, we will focus on probably the most world-renowned TV series “Friends”. This project would be divided into two parts. The first part is the general analysis on the data we have. It could be further broken down into three subcategories: season analysis, episode analysis, and character analysis. We will create a lasso regression model using tidymodels package to predict the IMDb rating for episodes in the second part. What we are going to do in the second part is quite similar to what we have done in the “Office” post, so also check that out if you are interested.

2. Preliminary data analysis

First of all, we need to load the packages we are going to use later and import the friends data.

library(tidyverse)
library(tidymodels)
library(plotly)
library(stringr)
library(tidylo)
library(tidytext)

theme_set(theme_classic())
friends <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-08/friends.csv')
friends_emotions <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-08/friends_emotions.csv')
friends_info <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-08/friends_info.csv')

We can take a look at each of them to see what kinds of data are included.

We see that the friends data contains the lines spoken by each character in each episodes; friends_info has the data about the directors and writers, count for US viewers, and the IMDb rating for each episodes; in friends_emotions, we can find the category of emotion related to each utterance.

friends%>%head()
## # A tibble: 6 x 6
##   text                                 speaker    season episode scene utterance
##   <chr>                                <chr>       <dbl>   <dbl> <dbl>     <dbl>
## 1 There's nothing to tell! He's just … Monica Ge…      1       1     1         1
## 2 C'mon, you're going out with the gu… Joey Trib…      1       1     1         2
## 3 All right Joey, be nice. So does he… Chandler …      1       1     1         3
## 4 Wait, does he eat chalk?             Phoebe Bu…      1       1     1         4
## 5 (They all stare, bemused.)           Scene Dir…      1       1     1         5
## 6 Just, 'cause, I don't want her to g… Phoebe Bu…      1       1     1         6
friends_info%>%head()
## # A tibble: 6 x 8
##   season episode title directed_by written_by air_date   us_views_millio…
##    <dbl>   <dbl> <chr> <chr>       <chr>      <date>                <dbl>
## 1      1       1 The … James Burr… David Cra… 1994-09-22             21.5
## 2      1       2 The … James Burr… David Cra… 1994-09-29             20.2
## 3      1       3 The … James Burr… Jeffrey A… 1994-10-06             19.5
## 4      1       4 The … James Burr… Alexa Jun… 1994-10-13             19.7
## 5      1       5 The … Pamela Fry… Jeff Gree… 1994-10-20             18.6
## 6      1       6 The … Arlene San… Adam Chas… 1994-10-27             18.2
## # … with 1 more variable: imdb_rating <dbl>
friends_emotions%>%head()
## # A tibble: 6 x 5
##   season episode scene utterance emotion
##    <dbl>   <dbl> <dbl>     <dbl> <chr>  
## 1      1       1     4         1 Mad    
## 2      1       1     4         3 Neutral
## 3      1       1     4         4 Joyful 
## 4      1       1     4         5 Neutral
## 5      1       1     4         6 Neutral
## 6      1       1     4         7 Neutral

2.1 Season analysis

As said in the introduction, we will first do a season analysis. In specific, we will analyze the distribution fo IMDb rating for each of the 10 seasons.

With the help of the regression line, we find the pattern for the IMDb rating is that the rating reaches the peak around Season 5, then the rating declines and reaches bottom at Season 8. The rating finally climbs back to the top at the last season. In addition, we can see that IMDb rating varies the most at Season 4: it has both the lowest rating of 7.2 as well as the second highest rating of 9.5.

ggplotly(friends_info%>%mutate(season = as.factor(season))%>%
  ggplot(aes(y  = imdb_rating, x = season,fill = season))+
  geom_boxplot(alpha  = 0.3,show.legend = FALSE)+
  geom_smooth(aes(group = 1), se = FALSE)+theme(legend.position = "null")+
  labs(x = "Season", y = "IMDb rating",
       title ="IMDb rating distribution for each season"))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: `group_by_()` is deprecated as of dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

2.2 Episode analysis

Next we focus on the episodes analysis. In order for the titles of all episodes to fit in the graph, we first need to drop the starting phrases such as “The One with” and “The One Where”. After doing so, we can visualize the information. The following graph shows the IMDb rating for each of the episodes. It’s not hard to see that “the Prom Video”, “Everybody Finds Out”, “the Embryos”, and “The Last one” have the highest rating. We also notice that “the Invitation”, “Mac and C.H.E.E.S.E.”, “the Vows”, “Christmas” have the lowest ratings.

remove<-"The One with|The One Where"
title_clean<-stringr::str_remove_all(friends_info$title,remove)

friends_info%>%mutate(epi_num = row_number())%>%
  ggplot(aes(x = epi_num, y = imdb_rating ))+
  geom_point(aes(color = as.factor(season),size = us_views_millions),alpha = 0.6)+
  geom_path(aes(color = as.factor(season)),alpha = 0.6)+
  expand_limits(x = -10)+
  expand_limits(x = 230)+
  geom_text(label = title_clean,hjust = 1,check_overlap = TRUE,size = 2.5)+
  theme(legend.position = "null")+geom_smooth()+
  labs(x = "Total episode number", 
       y = "IMDb rating", 
       title = "The change of IMDb rating across episodes")+
  scale_color_brewer(palette = "Paired")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

To take a closer look, we can find out the best 15 episodes as well as the worst 15 episodes.

ggplotly(friends_info%>%distinct(title,imdb_rating,season)%>%
  top_n(15,imdb_rating)%>%
  ggplot(aes(x = imdb_rating,
             y = reorder(title, imdb_rating),
             color = as.factor(season)))+
  geom_point(size = 3)+
  geom_segment(aes(x = 9.0,
                   xend = imdb_rating, 
                   y = title, 
                   yend = title),size = 2, alpha = 0.3)+
    scale_x_continuous(breaks = c(9,9.1,9.2,9.3,9.4,9.5,9.6))+
    labs(x = "IMDb rating",
         y = "Title",
         title = "The top 15 episodes with best ratings",
         color = "seasons"))
ggplotly(friends_info%>%distinct(title,imdb_rating,season)%>%
  top_n(-15,imdb_rating)%>%
  ggplot(aes(x = imdb_rating,
             y = reorder(title, imdb_rating),
             color = as.factor(season)))+
  geom_point(size = 3)+
    geom_segment(aes(x = 7, 
                     xend = imdb_rating,
                     y = title, 
                     yend = title),size = 2,alpha = 0.3)+
    scale_x_continuous(breaks = c(7.2,7.4,7.6,7.8,8))+
    labs(x = "IMDb rating",
         y = "Title",
         title = "The top 15 episodes with worst ratings",
         color = "season"))

Since we also have the viewing number at hand, we can ask the question: does the rating depends on the viewing number? To answer it, we can do a regression of rating against the viewing numbers, and here is the result: there is a highly positive correlation between these two variables.

friends_info%>%
  ggplot(aes(x = us_views_millions, y = imdb_rating))+
  geom_point()+geom_smooth()+
  labs(x = "US views (in millions)",
       y = "IMDb rating",
       title = "The relationship between viewing counts and IMDb rating")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Then, we can find out how the sentiment level changes across episodes. To do this, we first need to tokenize all the lines of each episodes and then sum the sentiment value for each word up to get the sentiment level of an episode. Here is what we get: we see that on average, the sentiment level gradually increases across the seasons. From the graph, we notice that “The one with Rachel Has a Baby” has the lowest sentiment level. I don’t remember the content exactly, but the negative sentiment could due to the fact that Rachel and Ross are having a fight in that episode and Rachel complains a lot about the suffer from giving a labor.

text<-friends_info%>%mutate(epi_num = row_number())%>%
  select(season,episode,title,epi_num)%>%
  right_join(friends, by = c("season" = "season", "episode" = "episode"))%>%distinct(season,epi_num,title,text)

text%>%
  unnest_tokens(word,text)%>%
  anti_join(stop_words)%>%inner_join(get_sentiments("afinn"))%>%
  group_by(season,epi_num,title)%>%
  summarise(sent = sum(value))%>%ungroup()%>%ggplot(aes(x = epi_num, y = sent))+
  geom_point(aes(color = as.factor(season)),alpha = 0.6)+
  geom_line(aes(color = as.factor(season)),alpha = 0.6)+
  geom_text(aes(label = title_clean),check_overlap = TRUE,size = 2.5)+
  geom_smooth()+
  scale_color_brewer(palette = "Paired")+theme(legend.position = "null")+
  expand_limits(x = -10)+
  labs(x = "Total episode number", 
       y = "Sentiment level",
       title = "The change of sentiment level across episodes")
## Joining, by = "word"
## Joining, by = "word"
## `summarise()` regrouping output by 'season', 'epi_num' (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

2.3 Character analysis

Now we start our analysis on the characters. Intuitively, we can find out the chracters with the most lines spoken throught the 10 seasons. Based on the graph, it’s obvious that the six main characters have way more lines than the rest of people. Among the six, Rachel and Ross have the most lines with over 9000; Phoebe has the least lines of 7500.

ggplotly(friends%>%count(speaker)%>%
  filter(speaker!="NA")%>%
  filter(n>120)%>%arrange(desc(n))%>%
  ggplot(aes(y = reorder(speaker,n), x = n,color = speaker))+
  geom_point(size = 4)+
  geom_segment(aes(x = 40,
                   xend = n, 
                   y = reorder(speaker,n),
                   yend = reorder(speaker,n)),
               size = 3, alpha = 0.2)+
  scale_x_log10()+theme(legend.position = "null")+
  labs(x = "Number of times one speaks", 
       y = "Characters",
       title = "The characters with the most lines"))

Next, we use the bind_tf_idf() function from tidytext package to calculate the tf_idf of each words spoken by the main characters. In this way, by ranking the words in the decreasing order of tf_idf, we are able to get the words uniquely spoken by the characters.

friends%>%
  select(speaker, text)%>%unnest_tokens(word, text)%>%
  anti_join(stop_words)%>%count(speaker,word)%>%add_count(speaker)%>%
  filter(nn>7000)%>%filter(speaker!="Scene Directions")%>%
  bind_tf_idf(word,speaker,n)%>%
  group_by(speaker)%>%
  top_n(10,tf_idf)%>%
  ungroup()%>%
  ggplot(aes(x = tf_idf, 
             y = reorder_within(word, tf_idf, speaker), 
             fill = speaker))+
  geom_col()+facet_wrap(~speaker, scales = "free_y")+scale_y_reordered()+
  scale_fill_brewer(palette = "Paired")+
  theme(legend.position = "null")+
  labs(x = "tf-idf", 
       y = "Words",
       title = "Top 10 unique words used by the main characters calculated by tf-idf")+
  theme_minimal()+
  theme(legend.position = "null")
## Joining, by = "word"
## Using `n` as weighting variable
## ℹ Quiet this message with `wt = n` or count rows with `wt = 1`
## Storing counts in `nn`, as `n` already present in input
## ℹ Use `name = "new_name"` to pick a new name.

In fact, we have another way to find the unique words for the characters by calculating the log odds ratio for each words with the help of bind_log_odds() function from tidylo package.

friends%>%
  select(speaker, text)%>%unnest_tokens(word, text)%>%
  anti_join(stop_words)%>%count(speaker,word)%>%add_count(speaker)%>%
  filter(nn>7000)%>%filter(speaker!="Scene Directions")%>%
  add_count(word)%>%
  filter(nnn > 50)%>%
  bind_log_odds(speaker,word,n)%>%
  group_by(speaker)%>%
  top_n(10,log_odds_weighted)%>%
  ggplot(aes(x = log_odds_weighted, 
             y = reorder_within(word, log_odds_weighted, speaker), 
             fill = speaker))+
  geom_col()+facet_wrap(~speaker, scales = "free_y")+scale_y_reordered()+
  scale_fill_brewer(palette = "Paired")+
  labs(x = "Weighted log odds ratio", 
       y = "Words",
       title = "Top 10 unique words used by the main characters calculated by odds ratio")+
  theme(legend.position = "null")
## Joining, by = "word"
## Using `n` as weighting variable
## ℹ Quiet this message with `wt = n` or count rows with `wt = 1`
## Storing counts in `nn`, as `n` already present in input
## ℹ Use `name = "new_name"` to pick a new name.
## Using `n` as weighting variable
## ℹ Quiet this message with `wt = n` or count rows with `wt = 1`
## Storing counts in `nnn`, as `n` already present in input
## ℹ Use `name = "new_name"` to pick a new name.

We can actually compare the results of these two methods. For the log odds ratio method, the unique words tend to be mostly names of other characters and words like “whoa”, “yeah”, “honey”, and “sweetie”. We coudln’t really gain much important information from this method. However, if we take a look at the result of the tf-idf method, we will notice that the words are not only used uniquely by the characters, but they also reveal their personal information. For example, we see that for Ross, evolution and paleontology have the highest tf-idf, which makes sense since Ross is a paleontologist; for Joey, script and casting also have a high tf-idf for a similar reason. In this way, I believe the results of the tf-idf method is much better and more informative than the results of the log odds ratio method. If you show the visualization that depends on tf-idf to people who have never watched “Friends” before, they would probably be able to have a rough idea on who these characters are and what are their personalities.

Next, we can do a sentiment analysis on the characters. In the friends_emotions dataset, we only have the sentiment information of each utterance in episodes but without knowing which character is responsible for that utterance. In order to do the sentiment analysis for all the characters, we need to find out corresponding characters for each utterance. We can do this by joining the friends_emotions data with the friends data first. A really important thing to notice is that since different characters have different number of lines, we need to take this fact into consideration instead of just ranking the total number of utterance for a character within a type of emotion. That is to say, we need to divide by the total number of lines for each character.

After doing this, we get the top five characters for each type of emotion based on their adjusted count of utterance associated with that certain emotion. If you have watched “Friends”, you will find the ranking to be pretty accurate: Janice does have a comparatively high proportion of sad utterance, and the lines for Peter Becker are quite powerful since he is both a millionaire and UFC fighter.

friends%>%select(-text)%>%right_join(friends_emotions)%>%
    count(speaker, emotion)%>%add_count(speaker,name = "num")%>%
    filter(speaker!="NA")%>%
    filter(speaker!="#ALL#")%>%
    filter(num>50)%>%
    mutate(average_emo = n/num)%>%
    group_by(emotion)%>%top_n(6,average_emo)%>%
    ggplot(aes(y = reorder_within(speaker,average_emo,emotion), 
               x = average_emo, 
               fill = emotion))+
    geom_col()+
    facet_wrap(~emotion,scales = "free")+scale_y_reordered()+
    theme(legend.position = "null")+
    labs(x = "Average emotion level",
         y = "Characters",
         title = "Character rankings in different types of emotion")+
  scale_fill_brewer(palette = "Paired")
## Joining, by = c("season", "episode", "scene", "utterance")
## Using `n` as weighting variable
## ℹ Quiet this message with `wt = n` or count rows with `wt = 1`

An interesting question to ask is whether the rating for the IMDb rating of an episode depends on the number of lines spoken by the main characters or not. To do so, we can do a quite rough analysis by plotting the regression line of the IMDb rating for each episode against the proportion of lines a certain character has within that episode (we will elaborate more on this analysis in our model construction section).

Based on the visualization, the rating would goes down with the increasing proportion of lines spoken by Chandler and Joey. For Ross and Phoebe, the more they speak, the higher the rating for that episode would be.

friends_speak<-friends%>%count(speaker,season,episode)%>%
  add_count(speaker,name = "total_speak")%>%
  add_count(season,episode,name = "total_speak_per_ep")%>%
  filter(total_speak>7000)%>%select(-total_speak)%>%mutate(prop_line = n/total_speak_per_ep)%>%
  select(-n,-total_speak_per_ep)
## Using `n` as weighting variable
## ℹ Quiet this message with `wt = n` or count rows with `wt = 1`
## Using `n` as weighting variable
## ℹ Quiet this message with `wt = n` or count rows with `wt = 1`
friends_info%>%select(season,episode,imdb_rating)%>%
  inner_join(friends_speak,by = c("season","episode"))%>%
  ggplot(aes(x = prop_line, y = imdb_rating, color = speaker))+
  geom_point(alpha = 0.5)+facet_wrap(~speaker)+
  scale_x_log10()+
  geom_smooth()+
  scale_color_brewer(palette = "Paired")+
  labs(x = "Proportion of lines within a specific episodes",
       y = "IMDb rating",
       title = "The relationship between the proportion of lines spoken and rating")+
  theme(legend.position = "null")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

3. Constructing a lasso regression model

In this section, we start to build our regression model. First, the data needs to be further cleaned by removing the characters with few lines. The lines with speaker being “NA” and “Scene Directions” would also be filtered out. After that, we will make our dataset to be wider so that the dataset will have the following form. In this way, we know how many times the main characters speak in each episode.

friends_actors<-friends%>%
  select(season, episode,speaker)%>%
  count(season, episode,speaker)%>%add_count(speaker)%>%
  filter(nn>=150)%>%
  filter(speaker!="NA")%>%
  filter(speaker!="Scene Directions")%>%
  select(-nn)%>%
  group_by(season, episode)%>%
  pivot_wider(names_from = speaker, values_from = n, values_fill = 0)
## Using `n` as weighting variable
## ℹ Quiet this message with `wt = n` or count rows with `wt = 1`
## Storing counts in `nn`, as `n` already present in input
## ℹ Use `name = "new_name"` to pick a new name.
friends_actors
## # A tibble: 236 x 18
## # Groups:   season, episode [236]
##    season episode `#ALL#` `Chandler Bing` `Joey Tribbiani` `Monica Geller`
##     <dbl>   <dbl>   <int>           <int>            <int>           <int>
##  1      1       1       8              39               39              73
##  2      1       2       2              16                8              28
##  3      1       3      21              36               28              52
##  4      1       4       4              32               24              47
##  5      1       5       0              35               36              33
##  6      1       6      14              58               31              34
##  7      1       7       3              25               27              23
##  8      1       8       3              37               11              34
##  9      1       9       1              28               32              48
## 10      1      10       5              31               19              23
## # … with 226 more rows, and 12 more variables: `Phoebe Buffay` <int>, `Rachel
## #   Green` <int>, `Ross Geller` <int>, `Carol Willick` <int>, `Judy
## #   Geller` <int>, `Janice Litman Goralnik` <int>, `Frank Buffay Jr.` <int>,
## #   `Richard Burke` <int>, `Jack Geller` <int>, `Emily Waltham` <int>, `Mike
## #   Hannigan` <int>, `Charlie Wheeler` <int>

We do the same thing for the “friends_info” data so that it shows the writer and director composition for each episode.

Notice the difference in row number bewteen the two datasets. It is due to the fact that we have filtered out some of the directors and writers that only appear few times.

friends_creators<-friends_info%>%select(season,
                                        episode, 
                                        directed_by,
                                        written_by,
                                        imdb_rating,
                                        us_views_millions)%>%
  mutate(written_by = str_replace(written_by,"Teleplay by :"," &"))%>%
  mutate(written_by = str_remove(written_by,"Story by :"))%>%
  separate_rows(written_by,sep ="&")%>%
  pivot_longer(directed_by:written_by,names_to = "position",values_to = "name")%>%
  distinct(season,episode,imdb_rating,position,name,us_views_millions)%>%
  add_count(name)%>%filter(n>=12)%>%
  select(-position,-n)%>%
  mutate(count = 1)%>%group_by(season,episode, imdb_rating)%>%
  pivot_wider(names_from = name, values_from = count, values_fill = 0)

friends_creators
## # A tibble: 182 x 13
## # Groups:   season, episode, imdb_rating [182]
##    season episode imdb_rating us_views_millio… `James Burrows` ` Marta Kauffma…
##     <dbl>   <dbl>       <dbl>            <dbl>           <dbl>            <dbl>
##  1      1       1         8.3             21.5               1                1
##  2      1       2         8.1             20.2               1                1
##  3      1       3         8.2             19.5               1                0
##  4      1       4         8.1             19.7               1                0
##  5      1       7         9               23.5               1                0
##  6      1       8         8.1             21.1               1                0
##  7      1       9         8.2             23.1               1                0
##  8      1      10         8.1             19.9               0                0
##  9      1      11         8.2             26.6               1                0
## 10      1      14         8.3             23.8               1                0
## # … with 172 more rows, and 7 more variables: `Peter Bonerz` <dbl>, `Michael
## #   Lembeck` <dbl>, `Gail Mancuso` <dbl>, `Kevin S. Bright` <dbl>, `Gary
## #   Halvorson` <dbl>, ` Ted Cohen` <dbl>, ` Ellen Plummer` <dbl>

Then we join the two datasets together to get the friends_staff.

friends_staff<-friends_actors%>%inner_join(friends_creators,by = c("season","episode"))%>%
  inner_join(friends_info%>%select(season,episode, title), by= c("season","episode"))

friends_staff
## # A tibble: 182 x 30
## # Groups:   season, episode [182]
##    season episode `#ALL#` `Chandler Bing` `Joey Tribbiani` `Monica Geller`
##     <dbl>   <dbl>   <int>           <int>            <int>           <int>
##  1      1       1       8              39               39              73
##  2      1       2       2              16                8              28
##  3      1       3      21              36               28              52
##  4      1       4       4              32               24              47
##  5      1       7       3              25               27              23
##  6      1       8       3              37               11              34
##  7      1       9       1              28               32              48
##  8      1      10       5              31               19              23
##  9      1      11       3              42               32              41
## 10      1      14       0              40               21              21
## # … with 172 more rows, and 24 more variables: `Phoebe Buffay` <int>, `Rachel
## #   Green` <int>, `Ross Geller` <int>, `Carol Willick` <int>, `Judy
## #   Geller` <int>, `Janice Litman Goralnik` <int>, `Frank Buffay Jr.` <int>,
## #   `Richard Burke` <int>, `Jack Geller` <int>, `Emily Waltham` <int>, `Mike
## #   Hannigan` <int>, `Charlie Wheeler` <int>, imdb_rating <dbl>,
## #   us_views_millions <dbl>, `James Burrows` <dbl>, ` Marta Kauffman` <dbl>,
## #   `Peter Bonerz` <dbl>, `Michael Lembeck` <dbl>, `Gail Mancuso` <dbl>, `Kevin
## #   S. Bright` <dbl>, `Gary Halvorson` <dbl>, ` Ted Cohen` <dbl>, ` Ellen
## #   Plummer` <dbl>, title <chr>

After getting the cleaned data, we split it into training and testing data. In order to tune the parameter of our lasso model later, a bootstraps object is also created.

set.seed(123)
friends_split<-initial_split(friends_staff,strata = season)
friends_train <- training(friends_split)
friends_test <- testing(friends_split)

friends_boot <- bootstraps(friends_train,strata = season)

Next we use recipe function to pre-process our data. In the following code, we normalize all the numerical columns and remove the columns if they have zero variance or high correlation.

friends_recipe<-recipe(imdb_rating~., data =  friends_train)%>%
  update_role(title, new_role = "ID")%>%
  step_normalize(all_predictors())%>%
  step_zv(all_predictors())%>%
  step_corr(all_predictors())

prep(friends_recipe)%>%juice()
## # A tibble: 138 x 30
##    season episode `#ALL#` `Chandler Bing` `Joey Tribbiani` `Monica Geller`
##     <dbl>   <dbl>   <dbl>           <dbl>            <dbl>           <dbl>
##  1  -1.60  -1.57    3.80            0.326            0.298           3.09 
##  2  -1.60  -1.43    0.438          -1.62            -2.00           -0.628
##  3  -1.60  -1.16    1.56           -0.266           -0.812           0.940
##  4  -1.60  -0.619   0.998           0.157           -1.77           -0.133
##  5  -1.60  -0.483  -0.122          -0.604           -0.220           1.02 
##  6  -1.60  -0.347   2.12           -0.350           -1.18           -1.04 
##  7  -1.60  -0.211   0.998           0.580           -0.220           0.445
##  8  -1.60   0.196  -0.681           0.411           -1.03           -1.21 
##  9  -1.60   0.467  -0.122           1.34            -0.664          -1.54 
## 10  -1.60   0.603   0.438          -1.28            -0.664           1.68 
## # … with 128 more rows, and 24 more variables: `Phoebe Buffay` <dbl>, `Rachel
## #   Green` <dbl>, `Ross Geller` <dbl>, `Carol Willick` <dbl>, `Judy
## #   Geller` <dbl>, `Janice Litman Goralnik` <dbl>, `Frank Buffay Jr.` <dbl>,
## #   `Richard Burke` <dbl>, `Jack Geller` <dbl>, `Emily Waltham` <dbl>, `Mike
## #   Hannigan` <dbl>, `Charlie Wheeler` <dbl>, us_views_millions <dbl>, `James
## #   Burrows` <dbl>, ` Marta Kauffman` <dbl>, `Peter Bonerz` <dbl>, `Michael
## #   Lembeck` <dbl>, `Gail Mancuso` <dbl>, `Kevin S. Bright` <dbl>, `Gary
## #   Halvorson` <dbl>, ` Ted Cohen` <dbl>, ` Ellen Plummer` <dbl>, title <fct>,
## #   imdb_rating <dbl>

Then we use the workflow function to conduct our first fit. The fitting result shows that the viewing number, the lines Monica and Frank speak would increase the overall rating. However, this is just the result from our untuned model. Our next step is to tune the penaly parameter using the bootstraps object.

library(vip)
## Warning: package 'vip' was built under R version 4.0.2
## 
## Attaching package: 'vip'
## The following object is masked from 'package:utils':
## 
##     vi
set.seed(234)
friends_lasso<-linear_reg(penalty = 0.001, mixture=1)%>%set_engine("glmnet")

workflow()%>%add_recipe(friends_recipe)%>%add_model(friends_lasso)%>%
  fit(data = friends_train)%>%pull_workflow_fit()%>%tidy()%>%arrange(desc(estimate))
## Warning: package 'glmnet' was built under R version 4.0.2
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## Loaded glmnet 4.0-2
## # A tibble: 29 x 3
##    term                   estimate penalty
##    <chr>                     <dbl>   <dbl>
##  1 (Intercept)              8.50     0.001
##  2 us_views_millions        0.202    0.001
##  3 Monica Geller            0.109    0.001
##  4 Frank Buffay Jr.         0.0804   0.001
##  5 season                   0.0711   0.001
##  6 Ross Geller              0.0551   0.001
##  7 Chandler Bing            0.0505   0.001
##  8 Richard Burke            0.0319   0.001
##  9 #ALL#                    0.0291   0.001
## 10 Janice Litman Goralnik   0.0205   0.001
## # … with 19 more rows

The code below shows how to tune the parameter. Moreover, we fit the data once again to get the tuned_lasso

set.seed(2020)
tune_model <-linear_reg(penalty = tune(), mixture=1)%>%
  set_engine("glmnet")

lambda <- grid_regular(penalty(),levels = 50)

tuned_lasso<-tune_grid(workflow()%>%add_recipe(friends_recipe)%>%add_model(tune_model),
          resamples = friends_boot,
          grid = lambda)

To visualize the result, we plot the mean against the value for each penalty. The best value for the parameter is the value at the peak of rsq and the bottom of rmse.

tuned_lasso%>%collect_metrics()%>%
  ggplot(aes(x = penalty, y = mean, color = .metric))+
  geom_point()+geom_path()+facet_wrap(~.metric,scales = "free",ncol = 1)+
  scale_x_log10()+
  scale_color_brewer(palette = "Paired")
## Warning: Removed 4 rows containing missing values (geom_point).
## Warning: Removed 4 row(s) containing missing values (geom_path).

We take the best value for penalty and finalize our workflow with it.

best_penalty<-tuned_lasso%>%select_best("rmse")

final_workflow<-finalize_workflow(workflow()%>%
                                    add_recipe(friends_recipe)%>%
                                    add_model(tune_model), 
                                  best_penalty)

Finally, we can fit our finalized model to the testing data to see its performance.

set.seed(5432)
final_workflow%>%last_fit(friends_split)%>%collect_metrics()
## # A tibble: 2 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      0.496 
## 2 rsq     standard      0.0231

After doing the fitting, we can visualize in the following way to show the importance of each variables in the model.

set.seed(4321)
final_workflow%>%fit(data = friends_train)%>%
  pull_workflow_fit()%>%
  vi(lamda = best_penalty$penalty)%>%
  mutate(Importance = abs(Importance))%>%
  ggplot(aes(x = Importance, y = reorder(Variable, Importance), fill = Sign))+
  geom_col()+
  labs(x = "Importance",  y = "Variable", title = "The Variable of Importance graph")

Xuxin Zhang
Xuxin Zhang

Just a wondering village boy.

Related