IMDb rating analysis of "The Office"

1. Introduction

In this project, we would conduct data analysis on TV series “The Office” which is one of my favorite TV series of all time. Trust me, you either hate it or you would really love it.

Two datasets would be used for our analysis: one contains the rating for each episode of the 9 seasons; the other data contains the writers and directors of each episode. In addition, it includes lines spoken by each character in the TV series.

This project is broken down into three parts. The first part focuses on the episodes; the second part focuses on the the different characters; in the last part, we will do a lasso regression on the data we have to find out what are the factors that influence the rating of each episode.

2. Preliminary data analysis

2.1 Season and episode analysis

First, we need to import our data. We can take a look at the two datasets we have.

library(tidyverse)
library(tidytext)
library(schrute)
library(ggridges)
library(plotly)
library(DT)
library(tidymodels)

theme_set(theme_classic())

office_script<-schrute::theoffice

ratings_raw <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-17/office_ratings.csv")

office_script%>%head(10)
## # A tibble: 10 x 12
##    index season episode episode_name director writer character text 
##    <int>  <int>   <int> <chr>        <chr>    <chr>  <chr>     <chr>
##  1     1      1       1 Pilot        Ken Kwa… Ricky… Michael   All …
##  2     2      1       1 Pilot        Ken Kwa… Ricky… Jim       Oh, …
##  3     3      1       1 Pilot        Ken Kwa… Ricky… Michael   So y…
##  4     4      1       1 Pilot        Ken Kwa… Ricky… Jim       Actu…
##  5     5      1       1 Pilot        Ken Kwa… Ricky… Michael   All …
##  6     6      1       1 Pilot        Ken Kwa… Ricky… Michael   Yes,…
##  7     7      1       1 Pilot        Ken Kwa… Ricky… Michael   I've…
##  8     8      1       1 Pilot        Ken Kwa… Ricky… Pam       Well…
##  9     9      1       1 Pilot        Ken Kwa… Ricky… Michael   If y…
## 10    10      1       1 Pilot        Ken Kwa… Ricky… Pam       What?
## # … with 4 more variables: text_w_direction <chr>, imdb_rating <dbl>,
## #   total_votes <int>, air_date <fct>
ratings_raw %>%head(10)
## # A tibble: 10 x 6
##    season episode title             imdb_rating total_votes air_date  
##     <dbl>   <dbl> <chr>                   <dbl>       <dbl> <date>    
##  1      1       1 Pilot                     7.6        3706 2005-03-24
##  2      1       2 Diversity Day             8.3        3566 2005-03-29
##  3      1       3 Health Care               7.9        2983 2005-04-05
##  4      1       4 The Alliance              8.1        2886 2005-04-12
##  5      1       5 Basketball                8.4        3179 2005-04-19
##  6      1       6 Hot Girl                  7.8        2852 2005-04-26
##  7      2       1 The Dundies               8.7        3213 2005-09-20
##  8      2       2 Sexual Harassment         8.2        2736 2005-09-27
##  9      2       3 Office Olympics           8.4        2742 2005-10-04
## 10      2       4 The Fire                  8.4        2713 2005-10-11

As I mentioned in the introduction section, in order to study the season and episodes of “The Office”, the first thing we can do is to visualize the change of rating of each episode across time. With the help of the following graph, we see a clear pattern that the rating arrives the peak around season four and five. Then the rating starts to decline and hit the bottom at season 8. Eventually, with the airing of the finale, the rating starts to climb again. This pattern does go along with my personal watching experience. The show seems to lose some of its spirit after Michael left Scranton.

ratings_raw%>%mutate(total_ep = row_number())%>%
  ggplot(aes(x = total_ep, y = imdb_rating))+
  geom_point(aes(color = as.factor(season)))+
  geom_path(aes(color = as.factor(season)))+
  geom_text(aes(label= title),hjust = 1,check_overlap = TRUE)+
  theme(legend.position = "null")+expand_limits(x = -20)+
  labs(x = "Episode number", y = "IMDb rating", title = "How the rating of each episode changes over time")+geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

We can take a step further using a ridge plot to show how the rating changes for each season. From this graph, we see that the average rating first goes up, then it goes down.

ratings_raw%>%
  ggplot(aes(y = as.factor(season), 
             x = imdb_rating,
             fill = as.factor(season),
             group = as.factor(season)))+
  geom_density_ridges(alpha = 0.5, show.legend = FALSE)+
  expand_limits(y = 10.7)+
  labs(x = "IMDb rating", y = "Season number", title = "IMDb rating distribution for each season")
## Picking joint bandwidth of 0.178

Since we have all the lines spoken by the characters, we can conduct a sentiment analysis for each of episode.

episode_sent<-office_script%>%
  select(season,episode, episode_name,character,text)%>%
  unnest_tokens(word,text)%>%anti_join(stop_words)%>%
  inner_join(get_sentiments("afinn"))%>%group_by(season,episode, episode_name)%>%
  summarise(sentiment = sum(value))%>%ungroup()%>%
  transmute(total_ep = row_number(),episode_name, sentiment,season)
## Joining, by = "word"
## Joining, by = "word"
## `summarise()` regrouping output by 'season', 'episode' (override with `.groups` argument)
episode_sent%>%
  ggplot(aes(x = total_ep, y = sentiment))+
  geom_point(aes(color = as.factor(season)))+
  geom_path(aes(color = as.factor(season)))+
  geom_text(aes(label= episode_name),hjust = 1,check_overlap = TRUE)+
  theme(legend.position = "null")+expand_limits(x = -20)+
  labs(x = "Episode number", y = "Sentiment level", title = "How the sentiment level of each epsiode changes over time")+geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Based on the graph above, we see that the sentiment level gradually descreases, then it increases at the beginning of season 5. If we look closer, we will find that the sentiment level is largely influenced by the episodes with high sentimental level, and this is reflected by the ridge plot below.

episode_sent%>%
  ggplot(aes(y = as.factor(season), 
             x = sentiment,
             fill = as.factor(season),
             group = as.factor(season)))+
  geom_density_ridges(alpha = 0.5, show.legend = FALSE)+
  expand_limits(y = 10.7)+
  labs(x = "Sentiment level", y = "Season number", title = "Sentiment level distribution for each season")
## Picking joint bandwidth of 20.3

Then a natural question to ask is that if there is a relationship between the rating and sentiment level.

To do so, we will plot the rating of each episode against the corresponding sentiment level, and here is what we get. By adding a regression line, we notice that the relationship between these two factors is with a “W” shape. This means that the rating is sensitive to extreme sentiment level: only if the sentiment level is extremely high or low would the rating be affected.

Note that there is a small peak in rating around sentiment level of 50. This peak means that as the sentiment level increases gradually, the rating would increase accordingly. However, once a threshold is meet, the rating would actually begin to decrease.

library(ggrepel)
ratings_raw%>%mutate(total_ep = row_number())%>%
  left_join(episode_sent)%>%
  ggplot(aes(x = sentiment, y = imdb_rating ))+
  geom_point(aes(color = as.factor(season)))+
  geom_smooth()+labs(x = "Sentiment level", y = "IMDb rating", title = "The relationship between the sentiment level and the IMDb rating of an episode", color = "Season")
## Joining, by = c("season", "total_ep")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

2.2 Character analysis

Now we can begin our analysis on the characters of the TV series.

The following graph shows the top 10 people with the most lines in each of the 9 seasons. Based on it, we can see that Michael, Jim, Dwight, and Pam are top 4 people with the most lines. The first place is later overthrowned by Andy and then finally by Dwight.

(Spoil Alert: The sharp drop of Michael’s line is because he left Scranton after season 7. You will understand why Andy and Dwight become the persons with most lines after you watch the whole TV series.)

From this single graph, we could tell who are the main characters of the whole series.

library(tidytext)
office_script%>%count(season = as.factor(season), character)%>%
  group_by(season)%>%top_n(10)%>%
  ggplot(aes(x = n, y = reorder_within(character, n, season), fill = character))+
  geom_col()+scale_y_reordered()+facet_wrap(~season,scales = "free_y")+theme(legend.position = "null")+labs(x = "Number of lines", y = "Character name", title = "Top 10 people with the most lines in each season")
## Selecting by n

I have creted an interactive line plot to show the changing number of lines for each character more straightforwardly.

top_character<-office_script%>%count(character,sort = TRUE)%>%top_n(12)%>%pull(character)
## Selecting by n
ggplotly(office_script%>%count(season, character)%>%
    filter(character%in%top_character)%>%
    ggplot(aes(x = season, y = n, color = character))+geom_point()+
    geom_path()+scale_y_log10()+scale_x_continuous(breaks = 1:9)+labs(x = "Season", y = "Number of lines", title = "Changing number of lines for each character"))

Besides knowing how many lines each characters in different seasons, we can actualy find out what are the words uniquely spoken by the characters with the help of a text mining technique: tf-idf. Based on my familiarity with the tv series, these words are in fact unique to each of the characters.

office_script%>%
  select(season,episode, episode_name,character,text)%>%
  unnest_tokens(word,text)%>%anti_join(stop_words)%>%
  count(character,word)%>%filter(character%in%top_character)%>%
  bind_tf_idf(word,character,n)%>%
  group_by(character)%>%top_n(10)%>%ggplot(aes(x = tf_idf, y = reorder_within(word, tf_idf, character), fill = character))+geom_col(show.legend = FALSE)+facet_wrap(~character, scales = "free_y")+scale_y_reordered()+labs(y = "Words", title = "Unique words spoken by each characters")
## Joining, by = "word"
## Selecting by tf_idf

In addition, we can study the sentiment level of each character based on the lines they have. One should note that since differennt characters have varying number of lines, we have to take this fact into consideration when calculating the characters’ sentiment level.

character_season_line<-office_script%>%count(season, character)%>%
    filter(character%in%top_character)

ggplotly(office_script%>%
  select(season,episode, episode_name,character,text)%>%
  unnest_tokens(word,text)%>%anti_join(stop_words)%>%
  inner_join(get_sentiments("afinn"))%>%group_by(season,character)%>%
  summarise(sentiment = sum(value))%>%filter(character%in%top_character)%>%
  left_join(character_season_line)%>%mutate(av_sent = sentiment/n)%>%
  ggplot(aes(x = season, y = av_sent, color = character))+geom_point()+geom_path()+labs(x = "Season", y = "Average sentiment level", title = "Average sentiment level of each characters"))
## Joining, by = "word"
## Joining, by = "word"
## `summarise()` regrouping output by 'season' (override with `.groups` argument)
## Joining, by = c("season", "character")

From the graph above, we can see that Angela and Dwight are the two persons with the most negative sentiment level in the first 5 seasons, which is true since these two are pretty mean to others.

3. Lasso regression

Before we proceed to apply lasso regression to our data, we need to further clean the data we have.

remove_regex <- "[:punct:]|[:digit:]|parts |part |the |and"

office_ratings <- ratings_raw %>%
  transmute(
    episode_name = str_to_lower(title),
    episode_name = str_remove_all(episode_name, remove_regex),
    episode_name = str_trim(episode_name),
    imdb_rating
  )

office_info <- schrute::theoffice %>%
  mutate(
    season = as.numeric(season),
    episode = as.numeric(episode),
    episode_name = str_to_lower(episode_name),
    episode_name = str_remove_all(episode_name, remove_regex),
    episode_name = str_trim(episode_name)
  ) %>%
  select(season, episode, episode_name, director, writer, character)
actors<-office_info%>%count(episode_name, character)%>%
    add_count(character)%>%
    filter(nn >800)%>%
    select(-nn)%>%
    pivot_wider(names_from = character, values_from = n, values_fill = 0)
## Using `n` as weighting variable
## ℹ Quiet this message with `wt = n` or count rows with `wt = 1`
## Storing counts in `nn`, as `n` already present in input
## ℹ Use `name = "new_name"` to pick a new name.
actors
## # A tibble: 185 x 16
##    episode_name  Andy Angela Darryl Dwight   Jim Kelly Kevin Michael Oscar   Pam
##    <chr>        <int>  <int>  <int>  <int> <int> <int> <int>   <int> <int> <int>
##  1 a benihana …    28     37      3     61    44     5    14     108     1    57
##  2 aarm            44     39     30     87    89     0    30       0    28    34
##  3 after hours     20     11     14     60    55     8     4       0    10    15
##  4 alliance         0      7      0     47    49     0     3      68    14    22
##  5 angry y         53      7      5     16    19    13     9       0     7    29
##  6 baby shower     13     13      9     35    27     2     4      79     3    25
##  7 back from v…     3      4      6     22    25     0     5      70     0    33
##  8 banker           1      2      0     17     0     0     2      44     0     5
##  9 basketball       0      3     15     25    21     0     1     104     2    14
## 10 beach games     18      8      0     38    22     9     5     105     5    23
## # … with 175 more rows, and 5 more variables: Phyllis <int>, Ryan <int>,
## #   Toby <int>, Erin <int>, Jan <int>
creators <- office_info%>%distinct(episode_name,director,writer)%>%
  separate_rows(writer,sep = ";")%>%
  pivot_longer(director:writer,names_to = "job",values_to = "name")%>%
  add_count(name)%>%filter(n>10)%>%mutate(count = 1)%>%
  distinct(episode_name,name,count)%>%
  pivot_wider(names_from = name,values_from = count, values_fill = 0)

creators
## # A tibble: 135 x 14
##    episode_name `Ken Kwapis` `Greg Daniels` `B.J. Novak` `Paul Lieberste…
##    <chr>               <dbl>          <dbl>        <dbl>            <dbl>
##  1 pilot                   1              1            0                0
##  2 diversity d…            1              0            1                0
##  3 health care             0              0            0                1
##  4 basketball              0              1            0                0
##  5 hot girl                0              0            0                0
##  6 dundies                 0              1            0                0
##  7 sexual hara…            1              0            1                0
##  8 office olym…            0              0            0                0
##  9 fire                    1              0            1                0
## 10 halloween               0              1            0                0
## # … with 125 more rows, and 9 more variables: `Mindy Kaling` <dbl>, `Paul
## #   Feig` <dbl>, `Gene Stupnitsky` <dbl>, `Lee Eisenberg` <dbl>, `Jennifer
## #   Celotta` <dbl>, `Randall Einhorn` <dbl>, `Brent Forrester` <dbl>, `Jeffrey
## #   Blitz` <dbl>, `Justin Spitzer` <dbl>

With the help of the two chuncks of code above, we now have two new datasets at hand:actors and creators. The actors data shows the lines each characters have in a single episode, and the creators shows the writers and directors for each episode. Next, we join all the datasets we have into a single one.

rating_clean <- ratings_raw%>%transmute(season, episode,episode_name = title, rating = imdb_rating)%>%
    mutate(
    episode_name = str_to_lower(episode_name),
    episode_name = str_remove_all(episode_name, remove_regex),
    episode_name = str_trim(episode_name))

staff<-rating_clean%>%
    inner_join(creators)%>%
    inner_join(actors)
## Joining, by = "episode_name"
## Joining, by = "episode_name"
staff
## # A tibble: 134 x 32
##    season episode episode_name rating `Ken Kwapis` `Greg Daniels` `B.J. Novak`
##     <dbl>   <dbl> <chr>         <dbl>        <dbl>          <dbl>        <dbl>
##  1      1       1 pilot           7.6            1              1            0
##  2      1       2 diversity d…    8.3            1              0            1
##  3      1       3 health care     7.9            0              0            0
##  4      1       5 basketball      8.4            0              1            0
##  5      1       6 hot girl        7.8            0              0            0
##  6      2       1 dundies         8.7            0              1            0
##  7      2       2 sexual hara…    8.2            1              0            1
##  8      2       3 office olym…    8.4            0              0            0
##  9      2       4 fire            8.4            1              0            1
## 10      2       5 halloween       8.2            0              1            0
## # … with 124 more rows, and 25 more variables: `Paul Lieberstein` <dbl>, `Mindy
## #   Kaling` <dbl>, `Paul Feig` <dbl>, `Gene Stupnitsky` <dbl>, `Lee
## #   Eisenberg` <dbl>, `Jennifer Celotta` <dbl>, `Randall Einhorn` <dbl>, `Brent
## #   Forrester` <dbl>, `Jeffrey Blitz` <dbl>, `Justin Spitzer` <dbl>,
## #   Andy <int>, Angela <int>, Darryl <int>, Dwight <int>, Jim <int>,
## #   Kelly <int>, Kevin <int>, Michael <int>, Oscar <int>, Pam <int>,
## #   Phyllis <int>, Ryan <int>, Toby <int>, Erin <int>, Jan <int>

Now we are ready to build a machine learning model to study the dataset staff using the package tidymodel.

First we split our data into a training and testing one. In order to later tune the lasso regression model, we create a bootstraps object staff_boot.

set.seed(1234)
staff_split<-initial_split(staff,strata = season)
staff_train <- training(staff_split)
staff_test <- testing(staff_split)

set.seed(123)
staff_boot<-bootstraps(staff_train, strata = season)

With the help of recipe function, we could do some feature engineering to preprocess the data.

staff_recipe<-recipe(rating~., data = staff_train)%>%
  update_role(episode_name, new_role = "ID")%>%
  step_zv(all_predictors())%>%
  step_corr(all_predictors())%>%
  step_normalize(all_predictors())

prep(staff_recipe)%>%juice()
## # A tibble: 102 x 31
##    season episode episode_name `Ken Kwapis` `Greg Daniels` `B.J. Novak`
##     <dbl>   <dbl> <fct>               <dbl>          <dbl>        <dbl>
##  1  -1.73  -1.54  pilot               3.02           2.49        -0.397
##  2  -1.73  -1.40  diversity d…        3.02          -0.397        2.49 
##  3  -1.73  -1.27  health care        -0.328         -0.397       -0.397
##  4  -1.28  -1.54  dundies            -0.328          2.49        -0.397
##  5  -1.28  -1.40  sexual hara…        3.02          -0.397        2.49 
##  6  -1.28  -1.27  office olym…       -0.328         -0.397       -0.397
##  7  -1.28  -0.991 halloween          -0.328          2.49        -0.397
##  8  -1.28  -0.854 fight               3.02          -0.397       -0.397
##  9  -1.28  -0.717 client             -0.328         -0.397       -0.397
## 10  -1.28  -0.579 performance…       -0.328         -0.397       -0.397
## # … with 92 more rows, and 25 more variables: `Paul Lieberstein` <dbl>, `Mindy
## #   Kaling` <dbl>, `Paul Feig` <dbl>, `Lee Eisenberg` <dbl>, `Jennifer
## #   Celotta` <dbl>, `Randall Einhorn` <dbl>, `Brent Forrester` <dbl>, `Jeffrey
## #   Blitz` <dbl>, `Justin Spitzer` <dbl>, Andy <dbl>, Angela <dbl>,
## #   Darryl <dbl>, Dwight <dbl>, Jim <dbl>, Kelly <dbl>, Kevin <dbl>,
## #   Michael <dbl>, Oscar <dbl>, Pam <dbl>, Phyllis <dbl>, Ryan <dbl>,
## #   Toby <dbl>, Erin <dbl>, Jan <dbl>, rating <dbl>

Then we set up our first rough machine learning model. Based on the result, we see that with the penal equals to 0.1, Jim, Michael, Greg Daniels, and Jan are the most important factors in deciding the rating of a single episode.

library(vip)
## Warning: package 'vip' was built under R version 4.0.2
## 
## Attaching package: 'vip'
## The following object is masked from 'package:utils':
## 
##     vi
set.seed(2345)

lasso_model <- linear_reg(penalty = 0.1, mixture = 1)%>%
    set_engine("glmnet")

staff_wf<-workflow()%>%
  add_recipe(staff_recipe)%>%
  add_model(lasso_model)

staff_wf%>%fit(staff_train)%>%pull_workflow_fit()%>%
  tidy()%>%arrange(desc(estimate))
## Warning: package 'glmnet' was built under R version 4.0.2
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## Loaded glmnet 4.0-2
## # A tibble: 30 x 3
##    term             estimate penalty
##    <chr>               <dbl>   <dbl>
##  1 (Intercept)       8.37        0.1
##  2 Jim               0.106       0.1
##  3 Michael           0.0666      0.1
##  4 Greg Daniels      0.0180      0.1
##  5 Jan               0.00399     0.1
##  6 season            0           0.1
##  7 episode           0           0.1
##  8 Ken Kwapis        0           0.1
##  9 B.J. Novak        0           0.1
## 10 Paul Lieberstein  0           0.1
## # … with 20 more rows

Now we are going to tune the penalty parameter.

tune_model <- linear_reg(penalty = tune())%>%set_engine("glmnet")

lambda_grid <- grid_regular(penalty(), levels = 50)

tune_grid<-tune_grid(workflow()%>%add_recipe(staff_recipe)%>%add_model(tune_model),
          resamples = staff_boot,
          grid = lambda_grid)

Now we have two evaluation criteria (rmse and rsq) for the penalties.

tune_grid%>%collect_metrics()
## # A tibble: 100 x 7
##     penalty .metric .estimator   mean     n std_err .config
##       <dbl> <chr>   <chr>       <dbl> <int>   <dbl> <chr>  
##  1 1.00e-10 rmse    standard   0.625     25  0.0119 Model01
##  2 1.00e-10 rsq     standard   0.0679    25  0.0114 Model01
##  3 1.60e-10 rmse    standard   0.625     25  0.0119 Model02
##  4 1.60e-10 rsq     standard   0.0679    25  0.0114 Model02
##  5 2.56e-10 rmse    standard   0.625     25  0.0119 Model03
##  6 2.56e-10 rsq     standard   0.0679    25  0.0114 Model03
##  7 4.09e-10 rmse    standard   0.625     25  0.0119 Model04
##  8 4.09e-10 rsq     standard   0.0679    25  0.0114 Model04
##  9 6.55e-10 rmse    standard   0.625     25  0.0119 Model05
## 10 6.55e-10 rsq     standard   0.0679    25  0.0114 Model05
## # … with 90 more rows

We can visualize the two metrics in the following way:

tune_grid%>%collect_metrics()%>%
    ggplot(aes(x = penalty, y = mean, color = .metric))+
    geom_point()+
    geom_line()+facet_wrap(~.metric,scales = "free",ncol = 1)+scale_x_log10()
## Warning: Removed 3 rows containing missing values (geom_point).
## Warning: Removed 3 row(s) containing missing values (geom_path).

With the following code, we are able to select the best value for the penalty.

best_penalty<-tune_grid%>%select_best("rmse",maximize = FALSE)
## Warning: The `maximize` argument is no longer needed. This value was ignored.
best_penalty
## # A tibble: 1 x 2
##   penalty .config
##     <dbl> <chr>  
## 1  0.0954 Model45

With the best penalty at hand, we are ready to update our first rough model with this new parameter.

final_lasso<-finalize_workflow(workflow()%>%add_recipe(staff_recipe)%>%add_model(tune_model),
               best_penalty)
final_lasso
## ══ Workflow ═══════════════════════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ───────────────────────────────────────────────────────────────────────────────────
## 3 Recipe Steps
## 
## ● step_zv()
## ● step_corr()
## ● step_normalize()
## 
## ── Model ──────────────────────────────────────────────────────────────────────────────────────────
## Linear Regression Model Specification (regression)
## 
## Main Arguments:
##   penalty = 0.0954095476349996
## 
## Computational engine: glmnet

Finally, we can use the following code to visualize the importance of each factors. We can see that Greg Daniels, Jim, episode, and Michael are the top 4 factors with greatest importance. That is to say, episodes directed by Greg Daniels and with more lines for Jim and Michael would tend to have higher rating. However, we also notice that Kelly, season, Erin, and Randall Einhorn have the strongest negative influence on the rating. I’m a little surprised at the result since Erin is my favorite characters in the TV series.

final_lasso%>%fit(staff_train)%>%pull_workflow_fit()%>%vi(lamda = best_penalty$penalty)%>%
  mutate(Importance = case_when(Sign == "POS"~Importance,
                                       TRUE~(-1)*Importance))%>%
  ggplot(aes(x = Importance, y = reorder(Variable, Importance), color = Sign))+
  geom_point(size = 2)+
  geom_segment(aes(x = 0, xend = Importance, y = Variable, yend = Variable),size = 1.5)+
  labs(x = "Importance", y = "Variable", title = "The Variable of Importance plot")

Finally, we can fit our data to the testing data we created earlier, and here is the fitting result.

final_lasso%>%last_fit(staff_split)%>%collect_metrics()
## # A tibble: 2 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       0.429
## 2 rsq     standard       0.121
Xuxin Zhang
Xuxin Zhang

Just a wondering village boy.

Related