Data analysis on campus recruitment

1. Introduction

Today, we will analyze the data on campus recruitment. The data is simulated for MBA students at XYZ campus and includes the students’ performance from middle school all the way to MBA. It also contains the application status of a student as well as his/her salary if placed. In this project, we aim to unfold the key factors that determines the application status. In order to achieve this goal, we first do some primary data analysis and then build a machine learning model which hopefully could predict whether a student would be placed or not. You can access the data in the following websites.

https://www.kaggle.com/benroshan/factors-affecting-campus-placement

2. Preliminary data analysis

First, we need to import our data and loading the packages we are going to use.

library(tidyverse)
library(tidymodels)
library(plotly)
campus <- read_csv("/Users/xuxian/Documents/UCLA related/R/Projects/tidymodel/campus recruit/campus recruit.csv")
theme_set(theme_light())

We can take a look at the data we have. Notice that the names of the columns could be really confusing, so we need to change them into clearer names.

campus
## # A tibble: 215 x 15
##    sl_no gender ssc_p ssc_b hsc_p hsc_b hsc_s degree_p degree_t workex etest_p
##    <dbl> <chr>  <dbl> <chr> <dbl> <chr> <chr>    <dbl> <chr>    <chr>    <dbl>
##  1     1 M       67   Othe…  91   Othe… Comm…     58   Sci&Tech No        55  
##  2     2 M       79.3 Cent…  78.3 Othe… Scie…     77.5 Sci&Tech Yes       86.5
##  3     3 M       65   Cent…  68   Cent… Arts      64   Comm&Mg… No        75  
##  4     4 M       56   Cent…  52   Cent… Scie…     52   Sci&Tech No        66  
##  5     5 M       85.8 Cent…  73.6 Cent… Comm…     73.3 Comm&Mg… No        96.8
##  6     6 M       55   Othe…  49.8 Othe… Scie…     67.2 Sci&Tech Yes       55  
##  7     7 F       46   Othe…  49.2 Othe… Comm…     79   Comm&Mg… No        74.3
##  8     8 M       82   Cent…  64   Cent… Scie…     66   Sci&Tech Yes       67  
##  9     9 M       73   Cent…  79   Cent… Comm…     72   Comm&Mg… No        91.3
## 10    10 M       58   Cent…  70   Cent… Comm…     61   Comm&Mg… No        54  
## # … with 205 more rows, and 4 more variables: specialisation <chr>,
## #   mba_p <dbl>, status <chr>, salary <dbl>
campus<-campus%>%transmute(sl_no,gender,
                   middle_school_score = ssc_p,
                   middle_school_board = ssc_b,
                   high_school_score = hsc_p, 
                   high_school_board = hsc_b, 
                   high_school_special = hsc_s, 
                   college_score = degree_p, 
                   college_major = degree_t, 
                   workex, employ_test = etest_p,
                   mba_special = specialisation,
                   mba_score = mba_p, status,
                   salary)
campus
## # A tibble: 215 x 15
##    sl_no gender middle_school_s… middle_school_b… high_school_sco…
##    <dbl> <chr>             <dbl> <chr>                       <dbl>
##  1     1 M                  67   Others                       91  
##  2     2 M                  79.3 Central                      78.3
##  3     3 M                  65   Central                      68  
##  4     4 M                  56   Central                      52  
##  5     5 M                  85.8 Central                      73.6
##  6     6 M                  55   Others                       49.8
##  7     7 F                  46   Others                       49.2
##  8     8 M                  82   Central                      64  
##  9     9 M                  73   Central                      79  
## 10    10 M                  58   Central                      70  
## # … with 205 more rows, and 10 more variables: high_school_board <chr>,
## #   high_school_special <chr>, college_score <dbl>, college_major <chr>,
## #   workex <chr>, employ_test <dbl>, mba_special <chr>, mba_score <dbl>,
## #   status <chr>, salary <dbl>

Since our goal in this project is to create a model to predict whether a peron is placed or not, we can first take a look at how many people are placed in this data set.

campus%>%count(status)
## # A tibble: 2 x 2
##   status         n
##   <chr>      <int>
## 1 Not Placed    67
## 2 Placed       148

To study which factors can influence the placing status, we first start from analyzing how they perform in their middle school.

campus%>%ggplot(aes(x = middle_school_score, fill = status))+
    geom_histogram(aes(y = ..density..), alpha = 0.5, color = "white")+
    geom_density(alpha = 0.6)+scale_fill_brewer(palette = "Paired")+
    labs(x = "Scores at middle school", y = "Density", title = "Distribution of scores when the applicants were in middle school")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

In this visualization, we see that generally the applicants who receive the offer (placed) have a higher score in middle school.

Next, we turn to the type of board of education for these applicants.

library(ggplot2)
campus%>%
    ggplot(aes(x = middle_school_score, fill = status))+
    geom_histogram(aes(y = ..density..), alpha = 0.5, color = "white")+
    geom_density(alpha = 0.6)+
    scale_fill_brewer(palette = "Paired")+facet_wrap(~middle_school_board,ncol = 1)+
    labs(x = "Scores at middle school", y = "Density", title = "Distribution of scores when the applicants were in middle school")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Based on this graph, we could see that the scores for the people with central board of education diverges less compared to the those of people with other boards. This pattern is actually a reflection of the distribution of educational resources among middle schools: the reason why the scores of applicants studying in central board of education don’t differ that much is becauses they receive almost equal amount of educational resources. The difference is mostly caused by the the individual difference among the applicants; however, for schools with other types of board of education, in addition to the individual difference among the students, educational resources also play an important role in determining the scores of the students. For example, some schools may have a better team of teachers than the others. Gradually, these difference of educational resources would inevitably lead to the diverge of students performance.

Then we would proceed to their high schools. Again, we analyze the distribution of their scores for placed and unplaced applicants. From the following graph, we see the high school scores of unplaced applicants are generally lower than those of placed applicants, which means the high schools score is another important factor.

campus%>%
  ggplot(aes(x = high_school_score, fill = status))+
  geom_histogram(aes(y = ..density..), alpha = 0.5, color = "white")+
  geom_density(alpha = 0.6)+scale_fill_brewer(palette = "Paired")+
  labs(x = "Scores at high school", y = "Density", title = "Distribution of scores when the applicants were in high school")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can exploit more information by adding another variable to our plot above: high school specialization. From this graph, we can see two unusual spikes in the density for placed applicants specialized in Arts. My guess for this phenomenon is that Arts specialization is not bonus point in the recruitment process, so the applicants specialized in Arts during their high schools must be really excellent so that they could be considered by the recruiters.

campus%>%
  ggplot(aes(x = high_school_score, fill = status))+
  geom_histogram(aes(y = ..density..), alpha = 0.5, color = "white")+
  geom_density(alpha = 0.6)+
  scale_fill_brewer(palette = "Paired")+facet_wrap(~high_school_special)+
  labs(x = "Scores at high school", y = "Density", title = "Distribution of scores for applicants with different high school specialization")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

After finishing the analysis of the applicants’ performance in their high school, we then turn to their performance in college. As we can see, there still exists a diverge between placed applicants’ scores and unplaced applicants’ scores, which means the scores at college is also an important factor to consider during the recruiting process.

campus%>%
  ggplot(aes(x = college_score, fill = status))+
  geom_histogram(aes(y = ..density..), alpha = 0.5, color = "white")+
  geom_density(alpha = 0.6)+scale_fill_brewer(palette = "Paired")+
  labs(x = "Scores at college", y = "Density", title = "Distribution of scores when the applicants were in college")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Here, we see that for students studying other majors, there is no significant difference between the scores of placed and unplaced students. On the contrarory, we see that the diveregnce occurs in both Comm&Mgmt and Sci&Tech majors.

campus%>%
  ggplot(aes(x = college_score, fill = status))+
  geom_histogram(aes(y = ..density..), alpha = 0.5, color = "white")+
  geom_density(alpha = 0.6)+
  scale_fill_brewer(palette = "Paired")+facet_wrap(~college_major)+
  labs(x = "Scores at college", y = "Density", title = "Distribution of scores for college students with different majors")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Finally, we will analyze their MBA performance. Quite surprisingly, there are only two options for their MBA specialization: Mk&Fin and Mk&HR. Furthermore, we see that among those who specialized in Mk&Fin, there is a larger proportion of people who get placed than those specialized in Mk&HR, which means the company has a greater demand for people specialized in Mk&Fin.

ggplotly(campus%>%
  count(mba_special,status)%>%
  ggplot(aes(x = mba_special, y = n, fill = status))+
  geom_col(position = "fill")+scale_fill_brewer(palette = "Paired")+
  labs(x = "MBA specialization", y = "Proportion", title = "The proportion of applicants being placed in two MBA specialization"))

Based on the following graph, we see that in the more demanded specialization area of Mk&Fin, the average MBA scores of placed applicants are higher. However, for the the other specialization, the placed applicants don’t show any advantage in their scores. That is to say, the MBA score is a determining factor for recruitment only if the applicant is specialized in Mk&Fin. If the applicant specializes in Mk&HR, then the scores may not be as importnant.

campus%>%
  ggplot(aes(x = mba_special, y = mba_score, fill = status))+
  geom_boxplot(alpha = 0.7)+scale_fill_brewer(palette = "Paired")+
  labs(x = "MBA specialization", y = "MBA scores", title = "The score distribution in two MBA specialization")

We can take a step further to see if the MBA scores would affect the salary for the placed applicants.

Depends on the graph below, we see that for people who have previous working experience, between the MBA scores of 55 and 67, they tend to have a higher salary than those who have no working experience. The situation changes when the scores for applicants are greater than 67. This naturally leads us to ask the next question: does people with previous working experience generally have higher salary level?

ggplotly(campus%>%
           ggplot(aes(x = mba_score, y = salary, color = workex))+
           geom_point(alpha = 0.6)+
           geom_smooth(se = FALSE)+
           scale_color_brewer(palette = "Paired")+scale_y_log10()+
           labs(x = "MBA score", y = "Salary", title = "The relationship between MBA scores and salary"))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 67 rows containing non-finite values (stat_smooth).

The answer is revealed in the following graph. We see that the average salary is pretty much the same for both groups. The only difference is that there are more outlier points for the group with working experience. That is to say, generally, salary doesn’t depend on previous working experience. However, if you do have working experience, it is more likely for you to get unusually high salary.

ggplotly(campus%>%
  ggplot(aes(x = workex, y = salary, fill = workex))+geom_boxplot(alpha = 0.7)+
  scale_y_log10()+
  scale_fill_brewer(palette = "Paired")+
  labs(x = "Working experience", y = "Salary", title = "The salary distribution in regard of working experience"))
## Warning: Removed 67 rows containing non-finite values (stat_boxplot).

After all these analysis, we are finally ready to build our model.

3. Building the machine learning model

First we will clean our data once more.

campus_df<-campus%>%select(-sl_no)%>%mutate_if(is.character, factor)

Next, we will split our data into training and testing data. In order to compare the performance of different models later, we ill create a bootstraps object based on our training data.

set.seed(123)
campus_split <- initial_split(campus_df)

campus_train <- training(campus_split)
campus_test <- testing(campus_split)

campus_boot <- bootstraps(campus_train)

After getting the training data, we will pre-process our data with the help of recipe function in the tidymodels package.

campus_recipe<-recipe(status~., data = campus_train)%>%
  step_dummy(all_nominal(), -all_outcomes())%>%
  step_knnimpute(salary)%>%
  step_normalize(all_predictors())%>%
  step_corr(all_predictors())%>%
  step_zv(all_predictors())

prep(campus_recipe)%>%juice()
## # A tibble: 162 x 16
##    middle_school_s… high_school_sco… college_score employ_test mba_score  salary
##               <dbl>            <dbl>         <dbl>       <dbl>     <dbl>   <dbl>
##  1          -0.0872           2.37         -1.17        -1.25    -0.606  -0.171 
##  2           1.11             1.11          1.57         1.10     0.673  -0.996 
##  3          -0.281            0.0875       -0.325        0.239   -0.777  -0.407 
##  4          -1.15            -1.50         -2.01        -0.431   -0.498   0.784 
##  5           1.73             0.642         0.980        1.86    -1.17    1.66  
##  6          -1.25            -1.72          0.131       -1.25    -1.84   -0.0179
##  7           1.36            -0.309        -0.0442      -0.356   -0.0345 -0.383 
##  8           0.493            1.18          0.798        1.46    -0.180  -0.631 
##  9          -0.958            0.286        -0.746       -1.32    -1.73    0.0175
## 10          -0.958           -0.606        -0.886       -0.729   -0.255  -0.289 
## # … with 152 more rows, and 10 more variables: status <fct>, gender_M <dbl>,
## #   middle_school_board_Others <dbl>, high_school_board_Others <dbl>,
## #   high_school_special_Commerce <dbl>, high_school_special_Science <dbl>,
## #   college_major_Others <dbl>, college_major_Sci.Tech <dbl>, workex_Yes <dbl>,
## #   mba_special_Mkt.HR <dbl>
testing_proc <- bake(prep(campus_recipe), new_data = campus_test)

Now we can start to set up the models. In this project, we will build 4 models: a logistic regression model, a random forrest model, a nearest neighbor model, and a decision tree model.

lr_model <- logistic_reg()%>%set_engine("glm")

rf_model <- rand_forest()%>%set_mode("classification")%>%set_engine("ranger")

nn_model <- nearest_neighbor()%>%set_mode("classification")%>%set_engine("kknn")

tree_model <- decision_tree()%>%set_mode("classification")%>%set_engine("rpart")

Then we will create workflow based on the recipe and models we created earlier.

lr_wf<-workflow()%>%add_recipe(campus_recipe)%>%add_model(lr_model)

rf_wf<-workflow()%>%add_recipe(campus_recipe)%>%add_model(rf_model)

nn_wf<-workflow()%>%add_recipe(campus_recipe)%>%add_model(nn_model)

tree_wf<-workflow()%>%add_recipe(campus_recipe)%>%add_model(tree_model)

Finally, we are ready to fit our models with the bootstraps object we created.

lr_result<-lr_wf%>%fit_resamples(resamples = campus_boot, control = control_resamples(save_pred = TRUE,verbose = TRUE))

rf_result<-rf_wf%>%fit_resamples(resamples = campus_boot, control = control_resamples(save_pred = TRUE,verbose = TRUE))

nn_result<-nn_wf%>%fit_resamples(resamples = campus_boot, control = control_resamples(save_pred = TRUE,verbose = TRUE))

tree_result<-tree_wf%>%fit_resamples(resamples = campus_boot, control = control_resamples(save_pred = TRUE,verbose = TRUE))

By collecting the metrics, we can evaluate our models. Based on the two metrics we have, we can see that the random forrest model performs the best both in accuracy and roc_auc among the four models.

lr_result%>%collect_metrics()
## # A tibble: 2 x 5
##   .metric  .estimator  mean     n std_err
##   <chr>    <chr>      <dbl> <int>   <dbl>
## 1 accuracy binary     0.845    25 0.00800
## 2 roc_auc  binary     0.896    25 0.00894
rf_result%>%collect_metrics()
## # A tibble: 2 x 5
##   .metric  .estimator  mean     n std_err
##   <chr>    <chr>      <dbl> <int>   <dbl>
## 1 accuracy binary     0.856    25 0.00958
## 2 roc_auc  binary     0.908    25 0.00972
nn_result%>%collect_metrics()
## # A tibble: 2 x 5
##   .metric  .estimator  mean     n std_err
##   <chr>    <chr>      <dbl> <int>   <dbl>
## 1 accuracy binary     0.773    25  0.0110
## 2 roc_auc  binary     0.774    25  0.0110
tree_result%>%collect_metrics()
## # A tibble: 2 x 5
##   .metric  .estimator  mean     n std_err
##   <chr>    <chr>      <dbl> <int>   <dbl>
## 1 accuracy binary     0.793    25  0.0108
## 2 roc_auc  binary     0.772    25  0.0210

We can actually visualize the result using roc curve:

result_total<-lr_result%>%collect_predictions()%>%mutate(model = "lr")%>%bind_rows(rf_result%>%collect_predictions()%>%mutate(model = "rf"))%>%bind_rows(nn_result%>%collect_predictions()%>%mutate(model = "nn"))%>%bind_rows(tree_result%>%collect_predictions()%>%mutate(model = "tree"))

result_total%>%group_by(model)%>%roc_curve(status,`.pred_Not Placed`)%>%autoplot()

As a result, we are going to use random forrest model for the remaining part of this object. Once we have determined the model, we can fit one last time on our testing data. We can see this time, we have an estimate of 0.961, which is highly accurate in predicting the applicantion status.

set.seed(1234)
rf_model%>%set_engine("ranger", importance = "permutation")%>%
  fit(status~., prep(campus_recipe)%>%juice)%>%
  predict( new_data = testing_proc,type = "prob")%>%
  mutate(truth = testing_proc$status)%>%
  roc_auc(truth, `.pred_Not Placed`)
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.961

We can take a look at important factors in our random forrest model. From the graph, we can see the top four determining factors are middel_school_score, college_score, high_school_score, and mba_score, so academic performance is in fact really important. The reason why middle_school_score has the highest importance may be due to the fact the scores of placed and unplaced applicants already have diverged quite a lot in middle school. We notice that gender_M also has a comparablly high importance, which means gender discrimination still exists during the recruitment process.

set.seed(2345)
library(vip)
## Warning: package 'vip' was built under R version 4.0.2
## 
## Attaching package: 'vip'
## The following object is masked from 'package:utils':
## 
##     vi
rf_vip<-rf_model%>%
  set_engine("ranger", importance = "permutation")%>%
  fit(status~., prep(campus_recipe)%>%juice)%>%
  vi()

rf_vip%>%ggplot(aes(x = Importance,
                    y = reorder(Variable, Importance)))+
  geom_point()+
  geom_segment(aes(x = 0, 
                   xend = rf_vip$Importance, 
                   y =  reorder(Variable, Importance), 
                   yend =  reorder(Variable, Importance)))+
  labs(y = "Variables", title = "The Variable of Importance for the random forrest model")
## Warning: Use of `rf_vip$Importance` is discouraged. Use `Importance` instead.

Xuxin Zhang
Xuxin Zhang

Just a wondering village boy.

Related