How to conquer the mountains in Himalaya?

1. Introduction

This week’s data records every attempt people have made to climb the mountains in Himalaya. As an amature hiker, I really hope one day in the future I could step my feet on the groud of Everest which is probably the most famous mountain in Himalaya.

In order to better prepare myself or other climbers (theoretically) for the future climbing, I hope to find out the factors that would have a huge influence on the success rate. After studying different factors in details, a machine learning model would be set up in the final part of this project that could predict my success rate of climbing the mountains in Himalaya.

2. Exploratory Data Analysis

To optimize the success rate, I ask myself the following questions:

  1. Which mountain(s) has the highest success rate?
  2. Which season should I make my first attempt?
  3. I should be in a group of how many members so that I could have the best success rate?
  4. Would it help me to improve my chance of getting to the peak by hiring more people?
  5. Should I use oxygen when I’m climbing?
  6. The citizens from which country have the higest success rate (so that I can hire them)?

I believe by answering these questions, the chance of success would be improve significantly, and this section would revolve around them.

2.1 Success rate of the top-climbed mountains

First, let’s find out the success rate for different mountains.

library(tidyverse)
theme_set(theme_light())
members <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv')
expeditions <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/expeditions.csv')
peaks <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/peaks.csv')

After importing the data, let’s first take a look at how many times have the mountains at Himalaya been climbed. Based on this graph, we see that the Everest, Ama Dablam, Cho Oyu, Manaslu, and Lhotse have the highest number of climbing attempts. That is to say the climbing routes of these mountains would be more mature than the others.

expeditions%>%
  mutate(peak_name = fct_lump_n(peak_name,9))%>%
  filter(!is.na(peak_name))%>%add_count(peak_name)%>%
  ggplot(aes(x = reorder(peak_name,n)))+geom_bar(fill = "lightskyblue3")+
  geom_text(aes(label = n, y = n), hjust = 1.3)+
  coord_flip()+
  labs(y = "Count", x = "",
       title = "The number of attempts of climbing for each mountains")+
  theme(
    plot.title = element_text(face = "bold"),
    panel.grid.major = element_blank(),
    axis.ticks = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    panel.background = element_rect(fill =  "aliceblue"),
    plot.background = element_rect(fill ="aliceblue")
  )

Then, we analyze how the number of successful and failing attempts changes over time. From the graph, the general trend is that as time goes on, more and more attempts have been made, so both successful and failing attempts have increased. However, notice the dramatic drop of successful attempts in 2015. After searching related news online, this drop is due to the fact that an earthquake happened near Everest and caused an avalanche. People were afraid of another earthquake, so they restricted their climbing. Details could be found here.

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
members%>%
  ggplot(aes(x = as.integer(year),fill = success))+
  geom_histogram(bins = 100)+
  theme(plot.title = element_text(face = "bold"),
    panel.grid.major = element_blank(),
    axis.ticks = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    panel.background = element_rect(fill =  "aliceblue"),
    plot.background = element_rect(fill ="aliceblue"),
    legend.background = element_rect(fill ="aliceblue")
  )+
  scale_fill_brewer(palette = "Paired")+
  labs(x = "Year", y = "Count",
       title = "How does the number of successful and failing attempts change?",
       subtitle = "After searching infomation online,
the plunging number of attempts at 2015 is due to
a huge earthquake happened around Everest")

The graph above only shows us the big picture, so to secure our chance of making to the top, we need to focus on analyzing specific mountains. Here, in the following graph, we visualize how the number of attempts changes for the mostly climbed mountains.s

library(plotly)
members%>%
  mutate(peak_name = fct_lump_n(peak_name,5))%>%
  filter(!is.na(peak_name))%>%
  ggplot(aes(x = year,fill = success))+
  geom_histogram(bins = 50)+
  labs(x = "Year", y = "Count",
       title="How the climbing attemps changes for the top-climbed  mountains?")+
  theme(
    
    plot.title = element_text(margin = margin(0,0,10,0),face = "bold"),
    panel.grid.major = element_blank(),
    axis.ticks = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    panel.background = element_rect(fill =  "aliceblue"),
    plot.background = element_rect(fill ="aliceblue"),
    legend.background = element_rect(fill ="aliceblue"))+
  facet_wrap(~peak_name,scales = "free")+
  scale_fill_brewer(palette = "Paired")

However, one should realize that usually the increasing number of successful attempts could be explained by the fact that more people are climbing the mountains. Thus, to leverage out this impact, we calculate the success rate.

After calculating the success rate for these mountains, we find out that even with short-term fluctuations, the success rates of only Cho Oyu, Manaslu and Everest seem to increase over time. The positive correlation is the strongest for Everest, so no wonder Everest is the mostly climbed mountains.

library(gganimate)
## Warning: package 'gganimate' was built under R version 4.0.2
p<-members%>%
  mutate(peak_name = fct_lump_n(peak_name,5))%>%
  filter(!is.na(peak_name))%>%
  group_by(peak_name,year)%>%
  summarise(success_rate = mean(success==TRUE))%>%
  ggplot(aes(x = year, y = success_rate, group = 1))+
  geom_point(color = "lightskyblue3")+
  geom_line(color = "lightskyblue3")+
  labs(x = "Year", y = "Success Rate",
       title = "How does the success rate change over time?",
       subtitle = "Only Cho Oyu, Manaslu and Everest show \n positive correlation between success rate and year.")+
  theme(
    plot.title = element_text(face = "bold"),
    plot.subtitle = element_text(size = 11,lineheight = 0.9),
    panel.grid.major = element_blank(),
    axis.ticks = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    panel.background = element_rect(fill =  "aliceblue"),
    plot.background = element_rect(fill ="aliceblue"),
    legend.background = element_rect(fill ="aliceblue"))+
  facet_wrap(~peak_name,scales = "free")+
  transition_reveal(year)
## `summarise()` regrouping output by 'peak_name' (override with `.groups` argument)
animate(p, end_pause = 40)

2.2 Season and success rate

This section would focus on this question: which season should one choose to climb the mountains? Intuitively, we would think that one should avoid climbing the mountains in winter. The weather is already freezing enough on the mountains, climbing in winter would only make things worse. But is our intuition true? The following analysis would tell us.

In fact, as the following diagram suggests, you should definitely climb different mountains in specific seasons. Take Everest for example, we see that most of the successful attempts happen in spring. That is to say, if you actually go to Evereset in other seasons, it is highly possible for you not to get to the peak successfully. If you want to climb Ama Dablam, then the favorable seasons would be autumn and winter. With the help of this graph, one should realize the importance of picking the right time to make their attempts.

expeditions%>%mutate(peak_name = fct_lump_n(peak_name,5))%>%
  filter(peak_name!="NA")%>%
  mutate(status = case_when(str_detect(termination_reason, "Success")~"Success",
                                        TRUE~"Failed"))%>%
  filter(season!="Unknown")%>%
  ggplot(aes( x = status, fill = status))+
  geom_bar()+
  labs(x = "", y = "Count",
       title = "The number of attempts for mostly-climbed mountains in different seasons")+
  theme(
    plot.title = element_text(face = "bold",size = 13),
    panel.grid.major = element_blank(),
    axis.ticks = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    panel.background = element_rect(fill =  "aliceblue"),
    plot.background = element_rect(fill ="aliceblue"),
    legend.background = element_rect(fill ="aliceblue"))+
  scale_fill_brewer(palette = "Paired")+
  facet_grid(season~peak_name,scales = "free")

2.3 Number of members and success rate

After figuring out which season to go for a climb, the next thing to consider is your group members: how many people should you climb together with?

The graph below shows the distribution of number of group members for the mostly-climbed mountains. As the graph illustrates most of the teams are made up of 6-10 people.

library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
expeditions%>%select(expedition_id,peak_name,members,member_deaths)%>%
  inner_join(members%>%select(expedition_id,peak_name,success))%>%
  filter(members>0)%>%
  mutate(peak_name = fct_lump_n(peak_name,5))%>%
  filter(!is.na(peak_name))%>%
  group_by(peak_name,members)%>%count(success)%>%
  mutate(success = case_when(success == "TRUE"~"True",
                             success == "FALSE"~"False"))%>%
  mutate(members = as.factor(case_when(between(members,1,5)~"1-5",
                             between(members,6,10)~"6-10",
                             between(members,11,15)~"11-15",
                             between(members,16,20)~"16-20",
                             between(members,21,25)~"21-25",
                             between(members,26,30)~"26-30",
                             between(members,31,35)~"31-35",
                             between(members,36,40)~"36-40",
                             TRUE~"41 and above")))%>%
  group_by(peak_name,members,success)%>%summarise(n = sum(n))%>%
  ggplot(aes(x = members, y = n, fill = success))+
  geom_histogram(stat = "identity")+
  labs(x = "Number of members", y = "Count",
       title = "Distribution of number of members for the mostly climbed mountains",
       subtitle = "It turns out 6-10 people in a group is the most popular choice.")+
  facet_wrap(~peak_name,scales = "free")+
  theme(axis.text.x = element_text(angle = 90))+
  theme(
    plot.title = element_text(face = "bold"),
    panel.grid.major = element_blank(),
    axis.ticks = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    panel.background = element_rect(fill =  "aliceblue"),
    plot.background = element_rect(fill ="aliceblue"),
    legend.background = element_rect(fill ="aliceblue"))+
  scale_fill_brewer(palette = "Paired")
## Joining, by = c("expedition_id", "peak_name")
## `summarise()` regrouping output by 'peak_name', 'members' (override with `.groups` argument)
## Warning: Ignoring unknown parameters: binwidth, bins, pad

However, this tells us nothing about the optimal number of members a team should have for each mountains, so we turn to calculate the success rate.

When you first lay your eyes upon this graph, you may quickly notice the spike of success rate for team of 31-35 people and think to yourselves “Wow, I should really get into a group of as many people as possible”. Nontheless, if you go back to the last graph, you will realize that the spike is only due to the fact that there are so few groups of more than 35 members.

In particular, if you want to climb Everest, then a team of 1-5 people should help you to improve your success rate.

expeditions%>%select(expedition_id,peak_name,members,member_deaths)%>%
  inner_join(members%>%select(expedition_id,peak_name,success))%>%
  filter(members>0)%>%
  mutate(peak_name = fct_lump_n(peak_name,5))%>%
  filter(!is.na(peak_name))%>%
  group_by(peak_name,members)%>%count(success)%>%
  mutate(success = case_when(success == "TRUE"~"True",
                             success == "FALSE"~"False"))%>%
  mutate(members = as.factor(case_when(between(members,1,5)~"1-5",
                             between(members,6,10)~"6-10",
                             between(members,11,15)~"11-15",
                             between(members,16,20)~"16-20",
                             between(members,21,25)~"21-25",
                             between(members,26,30)~"26-30",
                             between(members,31,35)~"31-35",
                             between(members,36,40)~"36-40",
                             TRUE~"41 and above")))%>%
  group_by(peak_name,members,success)%>%summarise(n = sum(n))%>%
  pivot_wider(names_from = success, values_from = n)%>%
  filter(True>0)%>%
  mutate(success_rate = True/(True+False))%>%
  ggplot(aes(x = members, y = success_rate))+
  geom_bar(stat = "identity",fill = "skyblue3" )+
  labs(x = "Number of members", y = "Success rate",
       title = "Success rate for teams with different number of members")+
  facet_wrap(~peak_name,scales = "free")+
  theme(axis.text.x = element_text(angle = 90))+
  theme(
    plot.title = element_text(face = "bold"),
    panel.grid.major = element_blank(),
    axis.ticks = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    panel.background = element_rect(fill =  "aliceblue"),
    plot.background = element_rect(fill ="aliceblue"),
    legend.background = element_rect(fill ="aliceblue"))+
  scale_fill_brewer(palette = "Paired")
## Joining, by = c("expedition_id", "peak_name")
## `summarise()` regrouping output by 'peak_name', 'members' (override with `.groups` argument)

To digress a little bit, let’s examine the relationship between member numbers and the time it takes to climb to the top. I’ve been wondering is it true that the more people I have in my group, the faster we will get to the peak. So I draw the graph below to answer my question. In fact, more people in your group would actually take you more time to reach the peak, suggested by the regression line of a positive gradient.

expeditions%>%
  filter(peak_name=="Everest")%>%
  mutate(status = case_when(str_detect(termination_reason, "Success")~"Success",
                                        TRUE~"Failed"))%>%
  mutate(total_member = members+hired_staff,
         time = highpoint_date-basecamp_date)%>%
  filter(status=="Success")%>%
  ggplot(aes(y = time, x = total_member))+
  geom_point( alpha = 0.3)+
  scale_x_log10()+
  geom_smooth()+
  labs(x = "Number of members in a group",
       y = "Time it takes to get to the peak",
       title = "The relationship between member numbers and time it takes to get to the top")+
  theme(
    plot.title = element_text(face = "bold",size = 12))
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

2.4 Ratio of hired members and success rate

In this section, we will answer the question “would a higher ratio of hired members in a team increase the success rate?” To answer it, what we need to do is to plot the success rate of the popular mountains against the average ratio of hired members.

To make the result meaningful, we need to first filter out the mountains that are barely climbed.

common_mounts<-expeditions%>%
  count(peak_name)%>%filter(n>15)

common_mounts
## # A tibble: 48 x 2
##    peak_name           n
##    <chr>           <int>
##  1 Ama Dablam       1366
##  2 Annapurna I       243
##  3 Annapurna II       35
##  4 Annapurna III      36
##  5 Annapurna IV       97
##  6 Annapurna South    36
##  7 Api Main           18
##  8 Baruntse          308
##  9 Bhrikuti           20
## 10 Chamlang           24
## # … with 38 more rows

After filtering out some mountains, here is what we have. With the addition of a regression line, we see that the relationship between success rate and hired ratio is actually in U shape. That is to say, only a relatively high and low hired ratio would contribute to higher success rate. In the case of climbing Everest, based on this graph, maybe you should chose a higher hired ratio to boost your success rate.

expeditions%>%filter(peak_name%in%common_mounts$peak_name)%>%
  mutate(status = case_when(str_detect(termination_reason, "Success")~"Success",
                                        TRUE~"Failed"))%>%
  mutate(total_member = members+hired_staff,
         hired_ratio = hired_staff/total_member)%>%
  group_by(peak_name)%>%summarise(success_rate = mean(status=="Success",na.rm = TRUE),
                                  hired_ratio = mean(hired_ratio,na.rm = TRUE))%>%
  ggplot(aes(x = hired_ratio, y = success_rate))+
  geom_point()+
  geom_text(aes(label = peak_name),check_overlap = TRUE,vjust = -1)+
  geom_smooth()+
  labs(x = "Hired Ratio", y = "Success rate",
       title = "The relationship between hired ratio and success rate")+
  theme(
    plot.title = element_text(face = "bold"),
    panel.grid.major = element_blank(),
    axis.ticks = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    panel.background = element_rect(fill =  "aliceblue"),
    plot.background = element_rect(fill ="aliceblue"),
    legend.background = element_rect(fill ="aliceblue"))+
  scale_fill_brewer(palette = "Paired")+
  lims(x = c(-0.04,0.43))
## `summarise()` ungrouping output (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

2.5 Oxygen usage and success rate

Last year, I travelled to Tibet in the end of my summer vacation. The average altitude there is about 3000-4000 meters, so my first two days were rather miserable due to the altitude sickness. Whenever I felt uncomfortable, I would inhale some pure oxygen sealed in a bottle. Trust me, oxygen could really save your life at certain point, so I do understand how important oxygen usage could be during the process of climbing the mountains in Himalay. Thus, I envision the success rate of people who used oxygen woudl be higher than those who didn’t, and as the graph below shows, my guess does make sense.

ggplotly(expeditions%>%
  mutate(status = case_when(str_detect(termination_reason, "Success")~"Success",
                                        TRUE~"Failed"))%>%
  filter(peak_name%in%common_mounts$peak_name)%>%
  group_by(year,oxygen_used)%>%summarise(success_rate = mean(status == "Success"))%>%
  ggplot(aes(x = year, y = success_rate, color = oxygen_used))+
  geom_point()+geom_line()+
  labs(x = "Year", y = "Success rate",
       title = "The annual success rate in regards of oxygen usage",
       color = "Oxygen Used")+
  theme(
    plot.title = element_text(face = "bold"),
    panel.grid.major = element_blank(),
    axis.ticks = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    panel.background = element_rect(fill =  "aliceblue"),
    plot.background = element_rect(fill ="aliceblue"),
    legend.background = element_rect(fill ="aliceblue"),
    legend.key =  element_rect(fill ="aliceblue"),
    legend.text = element_text(size = 10))+
  scale_color_brewer(palette = "Paired"))
## `summarise()` regrouping output by 'year' (override with `.groups` argument)

The first recorded usage of oxygen happened around 1925, but due to the lag behind of other climbing techinque and techology, there was hardly any difference between the success rate of people using and not using oxygen.

Gradually, as time goes on, the performance of these two groups begins to diverge. After 1986, the climbers who used oxygen have higher success rate than those who did not. Notice that the success rate of people using oxygen in 2019 is 0.90, which means 90% of people who successfully got to the top used oxygen when they were climbing.

Then, we turn to specific mountains to see if the pattern would differ from mountain to mountian. Based on the graph below, we see that for most of the mountains, the teams who have used oxygen tend to have a considerably higher success rate. In particular, for Everest, the difference of success rate between these two groups is so dramatic that I strongly recommend thos who want to climb Everest to use oxygen. However, for Ama Dablam, the pattern changes: the teams that used oxygen actually have a slightly lower success rate than those who didn’t.

expeditions%>%
  mutate(status = case_when(str_detect(termination_reason, "Success")~"Success",
                                        TRUE~"Failed"),
         peak_name = fct_lump(peak_name,5))%>%
  filter(peak_name!="NA")%>%
  group_by(peak_name,oxygen_used)%>%
  summarise(success_rate = mean(status == "Success"))%>%
  ggplot(aes(x = oxygen_used, y = success_rate, fill = oxygen_used))+
  geom_col()+
  labs(x = "Oxygen Used", y = "Success rate",
       title = "Relationship between oxygen usage and success rate")+
  theme(
    plot.title = element_text(face = "bold"),
    panel.grid.major = element_blank(),
    axis.ticks = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    panel.background = element_rect(fill =  "aliceblue"),
    plot.background = element_rect(fill ="aliceblue"),
    legend.position = "null")+
  scale_fill_brewer(palette = "Paired")+
  facet_wrap(~peak_name)
## `summarise()` regrouping output by 'peak_name' (override with `.groups` argument)

2.6 Citizenship and success rate

Next, we study the relationship between success rate and citizenship. First, let’s take a glimpse at the general picture of each country’s performance. In the following graph, we select and rank the top 10 countries which conquered the most number of mountains that hadn’t been climbed before. Based on it, we see that Nepal, Japan, and UK are the top three countries that conquered the most virgin peaks.

We have also annotated the top 10 mountains with the most climbing attempts in the graph. Notice that strangely, all of these mountains were first climbed around 1950.

library(ggrepel)
top_countries<-peaks%>%filter(!is.na(first_ascent_country), first_ascent_year>1000)%>%
  separate_rows(first_ascent_country,sep = ", ")%>%
  count(first_ascent_country,sort = TRUE)%>%head(10)

top_mounts <- common_mounts%>%top_n(10)
## Selecting by n
peaks%>%filter(!is.na(first_ascent_country), 
               first_ascent_year>1000)%>%
  separate_rows(first_ascent_country,sep = ", ")%>%
  add_count(first_ascent_country,sort = TRUE)%>%
  filter(first_ascent_country%in%top_countries$first_ascent_country)%>%
  mutate(first_ascent_country = reorder(first_ascent_country,n),
         peak_name = case_when(peak_name%in%top_mounts$peak_name~peak_name,
                               TRUE~ ""))%>%
  ggplot(aes(x = first_ascent_year, y = first_ascent_country))+
  geom_point(size = 3,alpha = 0.3)+
  geom_text_repel(aes(label = peak_name), force = 5)+
  labs(x= "First ascent year", y = "",
       title = "Top 10 countries that conquered the most virgin peaks")+
  theme(
    plot.title = element_text(face = "bold"),
    panel.grid.major = element_blank(),
    axis.ticks = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    panel.background = element_rect(fill =  "aliceblue"),
    plot.background = element_rect(fill ="aliceblue"),
    legend.background = element_rect(fill ="aliceblue"),
    legend.key =  element_rect(fill ="aliceblue"))

Next, we study the success rate of members from different countries. From the graph below, we see that Nepal, China, and India are the countries with the highest success rate. In this way, you should probably join a group with members of these citizenship.

members_clean<-members%>%select(sex, age, citizenship, expedition_role,success)

top_citizens<-members_clean%>%count(citizenship,sort = TRUE)%>%head(15)

members_clean%>%
  filter(citizenship%in%top_citizens$citizenship)%>%
  group_by(citizenship)%>%
  summarise(success_rate = mean(success == "TRUE"))%>%
  ggplot(aes(x = success_rate, y = reorder(citizenship,success_rate)))+
  geom_point(size = 4, color = "skyblue3")+
  geom_segment(aes(x = 0, xend = success_rate, y = citizenship, yend = citizenship),
               size = 2.3,
               alpha = 0.5, color = "skyblue3" )+
  labs(x = "Success rate", y ="",
       title = "Top 10 countries with the highest success rate")+
  theme(
    plot.title = element_text(face = "bold"),
    panel.grid.major = element_blank(),
    axis.ticks = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    panel.background = element_rect(fill =  "aliceblue"),
    plot.background = element_rect(fill ="aliceblue"),
    legend.background = element_rect(fill ="aliceblue"))
## `summarise()` ungrouping output (override with `.groups` argument)

2.7 Other factors that affect success rate

To make the analysis more comprehensive, in this section, we will examine other factors that might affect the success rate.

To begin with, we focus on the age and gender. Based on the graph, we see that as people ages, the success rate gradually decreases. That is to say, experience in life would not really help you to climb mcuh better.

In addition, from the age of 20 to about 50, male climbers tend to have a higher success rate than female climbers. After the age of 50, female climbers start to take the lead in the success rate. One should notice that at the age of 63, there is a huge difference of the success rate between male and female climbers.

members_clean%>%
  filter(!is.na(sex),
         !is.na(age))%>%
  add_count(sex,age,sort = TRUE)%>%
  filter(n >10)%>%
  group_by(sex,age)%>%
  summarise(success_rate = mean(success == "TRUE"))%>%
  ggplot(aes(x = age, y = success_rate, color = sex))+
  annotate("text", x = 20,y = 0.68, label = "(16,0.68)")+
  annotate("text", x = 17,y = 0.46, label = "(16,0.5)")+
  annotate("text", x = 67,y = 0.52, label = "(63,0.52)")+
  annotate("text", x = 67,y = 0.26, label = "(63,0.222)")+
  annotate("point", x = 16,y = 0.68, colour = "orange", size = 3)+
  annotate("point", x = 16,y = 0.5, colour = "orange", size = 3)+
  annotate("point", x = 63,y = 0.52, colour = "orange", size = 3)+
  annotate("point", x = 63,y = 0.222, colour = "orange", size = 3)+
  geom_point()+geom_line()+
  labs(x = "Age", y = "Success rate", fill = "Gender",
       title = "The relationship between success rate and age, together with gender")+
  theme(
    plot.title = element_text(face = "bold"),
    panel.grid.major = element_blank(),
    axis.ticks = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    panel.background = element_rect(fill =  "aliceblue"),
    plot.background = element_rect(fill ="aliceblue"),
    legend.background = element_rect(fill ="aliceblue"),
    legend.key = element_rect(fill ="aliceblue"))+
  scale_color_brewer(palette = "Paired")
## `summarise()` regrouping output by 'sex' (override with `.groups` argument)

Then we analyze the success rate of the positions in the team. As it turns out, the H-A assistant, H-A worker, and climbing leader have the highest success rate. The reason could be that their important positions indirectly reflect their climbing ability. As a result, they tend to have a higher success rate.

members_clean%>%
  filter(!is.na(expedition_role))%>%
  mutate(expedition_role = fct_lump_n(expedition_role,9))%>%
  group_by(expedition_role)%>%
  summarise(success_rate = mean(success == "TRUE"))%>%
  ggplot(aes(x = success_rate, y = reorder(expedition_role,success_rate)))+
  geom_point(size = 4, color = "skyblue3")+
  geom_segment(aes(x = 0, xend = success_rate, y = expedition_role, yend = expedition_role),
               size = 2.3,
               alpha = 0.5, color = "skyblue3")+
  theme(
    plot.title = element_text(face = "bold"),
    panel.grid.major = element_blank(),
    axis.ticks = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    panel.background = element_rect(fill =  "aliceblue"),
    plot.background = element_rect(fill ="aliceblue"),
    legend.background = element_rect(fill ="aliceblue"))
## `summarise()` ungrouping output (override with `.groups` argument)

3. Death and injury

The topic in this section is a little heavy, but we have to study it so that we could come back alive after the expeditions. The main focus is to find out what is the most deadly altitude of the mountains.

After studying the dataset closely, I plot the following graph. It filters out all the successful climbings and those without death. From this graph, we see that the majority of death happens 25 days after the teams left the basecamp. In addition, most of the death occurs at the altitude over 8000 meters, which is to say that most of people died when they were going to reach the top.

expeditions%>%
  filter(peak_name=="Everest")%>%
  mutate(status = case_when(str_detect(termination_reason, "Success")~"Success",
                                        TRUE~"Failed"))%>%
  mutate(total_death = member_deaths+hired_staff_deaths,
         time = highpoint_date-basecamp_date)%>%
  filter(status=="Failed", total_death!=0)%>%
  ggplot(aes(x = time, y = highpoint_metres))+
  geom_point(aes(size = total_death), alpha = 0.4)+
  labs(x = "Time spent after teams left basecamp",
       y="Altitude", 
       title = "When and where does the death occurs?")+
  theme(
    plot.title = element_text(face = "bold"),
    panel.grid.major = element_blank(),
    axis.ticks = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    panel.background = element_rect(fill =  "aliceblue"),
    plot.background = element_rect(fill ="aliceblue"),
    legend.background = element_rect(fill ="aliceblue"),
    legend.key = element_rect(fill ="aliceblue") )
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.
## Warning: Removed 9 rows containing missing values (geom_point).

We then switch to see the injury situation, and once again, we reach a similar conclusion. Based on the plot below, most of the injury occurs at the altitude almost reaching the peak.

members%>%filter(peak_name=="Everest")%>%
  ggplot(aes(x= injury_height_metres))+
  geom_histogram(fill = "lightskyblue3", color = "white")+
  labs(x = "Altitude when injury occurs",
       y = "Count",
       title = "Where does injury happens?")+
  theme(
    plot.title = element_text(face = "bold"),
    panel.grid.major = element_blank(),
    axis.ticks = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    panel.background = element_rect(fill =  "aliceblue"),
    plot.background = element_rect(fill ="aliceblue")
  )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 21511 rows containing non-finite values (stat_bin).

These two graphs really give us a lot of insights on how to successfully conquer the mountains: we should never relax a single bit only until we step our feet on the ground of the peaks. My speculation for the death and injury around the peak is that a lot of people suddenly loosened up the extreme stress they had been exerting on themseleves when they realized success is almost at hand. As a result of the changing mentality, they neglected certain dangerous elements that they would have noticed.

4. Build the machine learning model

library(tidymodels)
## Warning: package 'tidymodels' was built under R version 4.0.2
## ── Attaching packages ───────────────────────────────────────────────────────── tidymodels 0.1.1 ──
## ✓ broom     0.7.0      ✓ recipes   0.1.13
## ✓ dials     0.0.8      ✓ rsample   0.0.7 
## ✓ infer     0.5.3      ✓ tune      0.1.1 
## ✓ modeldata 0.0.2      ✓ workflows 0.1.3 
## ✓ parsnip   0.1.3      ✓ yardstick 0.0.7
## Warning: package 'broom' was built under R version 4.0.2
## Warning: package 'dials' was built under R version 4.0.2
## Warning: package 'infer' was built under R version 4.0.2
## Warning: package 'modeldata' was built under R version 4.0.2
## Warning: package 'parsnip' was built under R version 4.0.2
## Warning: package 'rsample' was built under R version 4.0.2
## Warning: package 'tune' was built under R version 4.0.2
## Warning: package 'yardstick' was built under R version 4.0.2
## ── Conflicts ──────────────────────────────────────────────────────────── tidymodels_conflicts() ──
## x scales::discard() masks purrr::discard()
## x plotly::filter()  masks dplyr::filter(), stats::filter()
## x recipes::fixed()  masks stringr::fixed()
## x dplyr::lag()      masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step()   masks stats::step()
members_selected<-members%>%
  filter(season!="Unknown",
         !is.na(sex), 
         !is.na(citizenship),
         !is.na(age),
         !is.na(expedition_role),
         !is.na(oxygen_used),
         !is.na(peak_name),
         !is.na(age))%>%
  select(-member_id,-expedition_id,-peak_id,-died,-death_cause,
         -death_height_metres,-injured,-injury_type,-hired,-solo,
         -injury_height_metres,-highpoint_metres)%>%
  mutate(success = case_when(success == "TRUE"~"True",
                             success == "FALSE"~"False"))%>%
  mutate_if(is.character,factor)%>%
  mutate_if(is.logical,as.integer)

set.seed(1234)
member_split<-initial_split(members_selected,strata = success)

member_train <- training(member_split)
member_test <- testing(member_split)

member_fold <- vfold_cv(member_train, strata = success)
member_recipe<-recipe(success~., data = member_train)%>%
  step_other(peak_name,expedition_role,citizenship)%>%
  step_dummy(all_nominal(),-success)%>%
  step_normalize(all_predictors())%>%
  step_zv(all_predictors())

prep(member_recipe)%>%juice()
## # A tibble: 54,740 x 20
##     year    age oxygen_used success peak_name_Cho.O… peak_name_Evere…
##    <dbl>  <dbl>       <dbl> <fct>              <dbl>            <dbl>
##  1 -1.63  0.256      -0.567 False             -0.364           -0.637
##  2 -1.63  0.352      -0.567 False             -0.364           -0.637
##  3 -1.63 -0.995      -0.567 False             -0.364           -0.637
##  4 -1.63  0.256      -0.567 False             -0.364           -0.637
##  5 -1.63 -0.321      -0.567 False             -0.364           -0.637
##  6 -1.63 -1.19       -0.567 False             -0.364           -0.637
##  7 -1.63 -0.802      -0.567 False             -0.364           -0.637
##  8 -1.55 -0.225      -0.567 False             -0.364           -0.637
##  9 -1.55  0.641      -0.567 False             -0.364           -0.637
## 10 -1.55 -1.19       -0.567 False             -0.364           -0.637
## # … with 54,730 more rows, and 14 more variables: peak_name_Manaslu <dbl>,
## #   peak_name_other <dbl>, season_Spring <dbl>, season_Summer <dbl>,
## #   season_Winter <dbl>, sex_M <dbl>, citizenship_Japan <dbl>,
## #   citizenship_Nepal <dbl>, citizenship_UK <dbl>, citizenship_USA <dbl>,
## #   citizenship_other <dbl>, expedition_role_H.A.Worker <dbl>,
## #   expedition_role_Leader <dbl>, expedition_role_other <dbl>
lr_model <- logistic_reg()%>%set_engine("glm")

rf_model <- rand_forest()%>%set_mode("classification")%>%set_engine("ranger",importance = "permutation")
lr_wf<-workflow()%>%add_recipe(member_recipe)%>%add_model(lr_model)

rf_wf<-workflow()%>%add_recipe(member_recipe)%>%add_model(rf_model)
set.seed(125)
lr_result<-lr_wf%>%fit_resamples(resamples = member_fold,
                      control = control_resamples(save_pred = TRUE,verbose = TRUE))

rf_result<-rf_wf%>%fit_resamples(resamples = member_fold,
                      control = control_resamples(save_pred = TRUE, verbose = TRUE))
lr_result%>%collect_metrics()
## # A tibble: 2 x 5
##   .metric  .estimator  mean     n std_err
##   <chr>    <chr>      <dbl> <int>   <dbl>
## 1 accuracy binary     0.769    10 0.00174
## 2 roc_auc  binary     0.821    10 0.00183
rf_result%>%collect_metrics()
## # A tibble: 2 x 5
##   .metric  .estimator  mean     n std_err
##   <chr>    <chr>      <dbl> <int>   <dbl>
## 1 accuracy binary     0.785    10 0.00161
## 2 roc_auc  binary     0.848    10 0.00178
library(vip)
## Warning: package 'vip' was built under R version 4.0.2
## 
## Attaching package: 'vip'
## The following object is masked from 'package:utils':
## 
##     vi
rf_model%>%
  fit(
    success~.,
    prep(member_recipe)%>%juice())%>%
  vip(geom = "point")

Xuxin Zhang
Xuxin Zhang

Just a wondering village boy.

Related