Introduction

H그룹 인사실에서 다양한 HR 업무 경험을 하며,
인사,조직문화진단,협업네트워크 등의 HR 데이터를 분석하고 있는 김광태입니다.

People Analytics 전문가를 꿈꾸며

2019년부터 2020년 초까지 Analytics 역량을 키우기 위해 지금까지 들었던 강의/자료와 HR 관련 분석기법들은
아래 링크에 정리해두었습니다.

Today I Learned | HR 분석 실무자를 위한 R Tips
Data Analytics 분야는 하루가 다르게 발전하고 있고, 새로운 기법들이 도입되고 있기에 끊임없이 공부하고 있습니다.
함께 분석 노하우를 공유하며, 나누어주시거나 제가 올린 내용에 대한 문의는 언제든지 환영합니다.
yuaye.kt@gmail.com 로 메일 주시면, 회신 드리겠습니다:)

Topic: Attrition/Turnover

1. why attrition/turnover?

주제 선정 이유

Attrition은 HR에서 항상 관심을 갖고 있는 주제이며
몇 년 전부터는 개방형 혁신으로, 산업분야와 회사간 인재 Pool의 경계가 모호해지면서,
SW, Bio 등을 중심으로 우수 인재에 대한 Talent Attraction이나 핵심 인력에 대한
Attrition/Turnover Management의 중요성이 높아지고 있습니다.¹
퇴사자를 예측하는 Attrition Modeling은 산업간 인재 Pool의 경계가 모호했던 미국 등
선진국 중심으로 지속 연구되어 왔으며, 아래 차트를 보시면 Management 분야에서
Attrition/Turnover에 대한 연구가 지속 증가하고 있음을 확인할 수 있습니다.
이미 많은 기업에서 Attrition/Turnover에 대해 분석하고, 예측모델을 개발하여
employee Retention, Talent Attraction 등에 활용하고 있습니다.²

data.frame(x=2000:2021, y=c(1,1,14,10,17,20,22,31,27,36,34,56,47,50,59,54,72,97,85,112,115,75)) %>% 
  ggplot(aes(x, y, label=paste0(y,"편")))+geom_line()+theme_bw()+theme(plot.caption= element_text(hjust= 0))+
  theme(axis.line = element_line(size=1), axis.ticks = element_line(size=1),panel.border = element_blank(), panel.grid.major = element_blank(),panel.grid.minor = element_blank())+geom_point()+ggrepel::geom_text_repel(size=3)+labs( x="publication year",y= "Number of publications in the field of management", caption = "Note. web of science 기준, Attrition/Turnover keyword 포함 논문 수")

Python, R을 기반으로 많은 Open Source들이 존재하지만,
대부분 hyperparameter tuning 등 분석기법 그 자체에만 집중하기에
저는 theoretical background 및 HR 업무 경험을 추가하여 Modeling을 진행했습니다.
이곳에는 3가지 Model을 기반으로 기본적인 분석을 진행한 내용을 정리하였으며,
HR 경험 기반의 Modeling과 전처리 등 한번에 표현이 불가능한 내용들은 포함하지 못했습니다.

2. Literature Review

Definition of Attrition Modeling

Speer(2021)³는 Attrition Modeling를 "이직을 예측하는 변수를 통계 알고리즘으로 결함하여
주어진 시간 내에 또는 특정 시점의 이직 확률을 추정하는 것"으로 정의했습니다.

Attrition models combine variables that predict turnover into statistical algorithms that then estimate the probability of employee turnover within a given timeframe, or at a specific timepoint;

Purpose of Attrition Modeling

Attrition Modeling의 목적에 대해서는 여러가지 선행연구를 기반으로 네 가지 목적을 제시하였는데, 요약하면 다음과 같습니다.
1. pre-employment selection⁴
2. validate and develop training initiative⁵
3. facilitate workforce planning discussions with specific part of the company⁶
4. create and hoc programs to reduce attrition⁷
Attrition Modeling은 이처럼 채용 의사결정, 인력 조정 및 개발 계획 수립을 위한 이니셔티브,
구성원의 Attrition을 줄이기 위한 intervention 설계 등, 전략적 HR을 위한 다양한 시사점을 제공합니다.

Purpose of Attrition Modeling: The formed attrition estimates can then serve a number of purposes, including use for pre-employment selection (Gibson et al., 2019; Strickland, 2005), to validate and develop training initiatives (McCloy et al., 2016; Strickland, 2005), to facilitate workforce planning discussions with specific parts of the company (Speer et al., 2019), to create ad hoc programs to reduce attrition (Strickland, 2005) and a variety of other HR purposes generally aimed at understanding and impacting employee turnover. The work is conducted both internally and by external vendors as well. For example, HR software companies currently offer features that include projected group-level turnover estimates within HR dashboards, as well as risk projections for individual employees. These are often accompanied by in-depth studies into the root causes of turnover, which then facilitate turnover interventions. Thus, attrition models serve various strategic HR purposes.

Preprocessing for Attrition Modeling

1. Process

분석 방향 및 프로세스

Kaggle에 올라와 있는 IBM HR Analytics Dataset을 기반으로
Attirition Model을 구축합니다.
Tidyverse 생태계를 따라, 최대한 tidy하게 작성하려고 노력하였으며,
그동안 People Analytics 업무를 어떤 흐름으로 진행해 왔는지 보여드리고자 노력했습니다.
아래 Process를 기준으로 Modeling을 진행했습니다.

No	Process	R Packages
1	Literature Review
2	Data Import	tidyverse
3	Tidy data + Transformation, Pre-Processing	tidyverse
4	visualization for EDA, Feature Engineering	dlookr, ExpanDar, tidyverse
5	Modeling(1) Logistic Regression	tidymodels
6	Modeling(2) RandomForest	tidymodels, randomForest
7	Modeling(3) AutoML	h2o
8	Reporting	Bookdown

본 markdown page는 결과에 대한 해석보다, 기본 분석 흐름을 공유드리고자 작성 했습니다.

2. Data Import

2-1. 분석에 사용할 library를 load합니다.

## [1] "/Users/raymondkim/Rproject/Turnover"

2-2. Data를 Import합니다.

# read_csv 기반 tibble type으로 import 합니다. 
Dataset <- read_csv("archive/Data.csv")
## 
## ─ Column specification ────────────────────────────
## cols(
##   .default = col_double(),
##   Attrition = col_character(),
##   BusinessTravel = col_character(),
##   Department = col_character(),
##   EducationField = col_character(),
##   Gender = col_character(),
##   JobRole = col_character(),
##   MaritalStatus = col_character(),
##   Over18 = col_character(),
##   OverTime = col_character()
## )
## ℹ Use `spec()` for the full column specifications.

# Data Import가 잘 되었는지 확인합니다. 
Dataset %>% glimpse
## Rows: 1,470
## Columns: 35
## $ Age                      <dbl> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 2…
## $ Attrition                <chr> "Yes", "No", "Yes", "No", "No", "No", "No", "…
## $ BusinessTravel           <chr> "Travel_Rarely", "Travel_Frequently", "Travel…
## $ DailyRate                <dbl> 1102, 279, 1373, 1392, 591, 1005, 1324, 1358,…
## $ Department               <chr> "Sales", "Research & Development", "Research …
## $ DistanceFromHome         <dbl> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26, …
## $ Education                <dbl> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3, …
## $ EducationField           <chr> "Life Sciences", "Life Sciences", "Other", "L…
## $ EmployeeCount            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ EmployeeNumber           <dbl> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16,…
## $ EnvironmentSatisfaction  <dbl> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, 3, …
## $ Gender                   <chr> "Female", "Male", "Male", "Female", "Male", "…
## $ HourlyRate               <dbl> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84, 4…
## $ JobInvolvement           <dbl> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, 2, …
## $ JobLevel                 <dbl> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1, …
## $ JobRole                  <chr> "Sales Executive", "Research Scientist", "Lab…
## $ JobSatisfaction          <dbl> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, 3, …
## $ MaritalStatus            <chr> "Single", "Married", "Single", "Married", "Ma…
## $ MonthlyIncome            <dbl> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 269…
## $ MonthlyRate              <dbl> 19479, 24907, 2396, 23159, 16632, 11864, 9964…
## $ NumCompaniesWorked       <dbl> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5, …
## $ Over18                   <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", …
## $ OverTime                 <chr> "Yes", "No", "Yes", "Yes", "No", "No", "Yes",…
## $ PercentSalaryHike        <dbl> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, 1…
## $ PerformanceRating        <dbl> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3, …
## $ RelationshipSatisfaction <dbl> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, 2, …
## $ StandardHours            <dbl> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 8…
## $ StockOptionLevel         <dbl> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0, …
## $ TotalWorkingYears        <dbl> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5, 3…
## $ TrainingTimesLastYear    <dbl> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4, …
## $ WorkLifeBalance          <dbl> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3, …
## $ YearsAtCompany           <dbl> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, 4,…
## $ YearsInCurrentRole       <dbl> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2, …
## $ YearsSinceLastPromotion  <dbl> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0, …
## $ YearsWithCurrManager     <dbl> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3, …

3. Preprocessing

3-1. 어떤 변수들로 구성되어 있는지 확인합니다.

3-2. 단일 값을 가진 변수를 제거합니다.

단일 값을 가진 변수 제거 는 예측 변수로 의미가 없기 때문에 제거합니다.


# 어떤 변수들로 구성되어 있는지 확인합니다. 
Dataset %>% diagnose() %>% arrange(unique_count)
## # A tibble: 35 x 6
##    variables      types   missing_count missing_percent unique_count unique_rate
##    <chr>          <chr>           <int>           <dbl>        <int>       <dbl>
##  1 EmployeeCount  numeric             0               0            1    0.000680
##  2 Over18         charac…             0               0            1    0.000680
##  3 StandardHours  numeric             0               0            1    0.000680
##  4 Attrition      charac…             0               0            2    0.00136 
##  5 Gender         charac…             0               0            2    0.00136 
##  6 OverTime       charac…             0               0            2    0.00136 
##  7 PerformanceRa… numeric             0               0            2    0.00136 
##  8 BusinessTravel charac…             0               0            3    0.00204 
##  9 Department     charac…             0               0            3    0.00204 
## 10 MaritalStatus  charac…             0               0            3    0.00204 
## # … with 25 more rows

  # unique_count = 1 변수(Over18,EmployeeCount, StandardHours), 
  # 의미 없는 변수(EmployeeNumber)제거
Dataset %>% dplyr::select(-Over18, -EmployeeCount, -StandardHours, -EmployeeNumber)->Dataset

3-3. Data Type 등 수정하거나 처리가 필요한 부분을 조치합니다.

diagnose 함수를 활용하여 변수를 살펴보니, categorical variable임에도
numeric으로 되어 있는 변수들이 보입니다.
우선 해당 변수들을 categorical variable로 변환합니다.


Dataset %>% diagnose_category()
## # A tibble: 30 x 6
##    variables      levels                     N  freq ratio  rank
##    <chr>          <chr>                  <int> <int> <dbl> <int>
##  1 Attrition      No                      1470  1233 83.9      1
##  2 Attrition      Yes                     1470   237 16.1      2
##  3 BusinessTravel Travel_Rarely           1470  1043 71.0      1
##  4 BusinessTravel Travel_Frequently       1470   277 18.8      2
##  5 BusinessTravel Non-Travel              1470   150 10.2      3
##  6 Department     Research & Development  1470   961 65.4      1
##  7 Department     Sales                   1470   446 30.3      2
##  8 Department     Human Resources         1470    63  4.29     3
##  9 EducationField Life Sciences           1470   606 41.2      1
## 10 EducationField Medical                 1470   464 31.6      2
## # … with 20 more rows
Dataset %>% diagnose_numeric()
## # A tibble: 23 x 10
##    variables            min    Q1   mean median     Q3   max  zero minus outlier
##    <chr>              <dbl> <dbl>  <dbl>  <dbl>  <dbl> <dbl> <int> <int>   <int>
##  1 Age                   18    30 3.69e1     36   43      60     0     0       0
##  2 DailyRate            102   465 8.02e2    802 1157    1499     0     0       0
##  3 DistanceFromHome       1     2 9.19e0      7   14      29     0     0       0
##  4 Education              1     2 2.91e0      3    4       5     0     0       0
##  5 EnvironmentSatisf…     1     2 2.72e0      3    4       4     0     0       0
##  6 HourlyRate            30    48 6.59e1     66   83.8   100     0     0       0
##  7 JobInvolvement         1     2 2.73e0      3    3       4     0     0       0
##  8 JobLevel               1     1 2.06e0      2    3       5     0     0       0
##  9 JobSatisfaction        1     2 2.73e0      3    4       4     0     0       0
## 10 MonthlyIncome       1009  2911 6.50e3   4919 8379   19999     0     0     114
## # … with 13 more rows


# Education, PerformanceRating, RelationshipSatisfaction, WorkLifeBalance, JobLevel, 
# StockOptionLevel, NumCompaniesWorked 이 Categorical variable임을 알 수 있음
# 향후 분석에서 의미를 파악하기 쉽도록 Factor로 변환 필요

#1) Education
gsub(1, 'below College',Dataset$Education) -> Dataset$Education
gsub(2, 'College',Dataset$Education) -> Dataset$Education
gsub(3, 'Bachelor',Dataset$Education) -> Dataset$Education
gsub(4, 'Master',Dataset$Education) -> Dataset$Education
gsub(5, 'Doctor',Dataset$Education) -> Dataset$Education
Dataset$Education %>% as.factor %>% unique
## [1] College       below College Master        Bachelor      Doctor       
## Levels: Bachelor below College College Doctor Master

#2) Performance Rating
gsub(1, 'Low',Dataset$PerformanceRating) -> Dataset$PerformanceRating
gsub(2, 'Good',Dataset$PerformanceRating) -> Dataset$PerformanceRating
gsub(3, 'Excellent',Dataset$PerformanceRating) -> Dataset$PerformanceRating
gsub(4, 'Outstanding',Dataset$PerformanceRating) -> Dataset$PerformanceRating
Dataset$PerformanceRating %>% as.factor %>% unique
## [1] Excellent   Outstanding
## Levels: Excellent Outstanding

#3) WorklifeBalance

gsub(1, 'Bad',Dataset$WorkLifeBalance) -> Dataset$WorkLifeBalance
gsub(2, 'Good',Dataset$WorkLifeBalance) -> Dataset$WorkLifeBalance
gsub(3, 'Better',Dataset$WorkLifeBalance) -> Dataset$WorkLifeBalance
gsub(4, 'Best',Dataset$WorkLifeBalance) -> Dataset$WorkLifeBalance
Dataset$WorkLifeBalance %>% as.factor %>% unique
## [1] Bad    Better Good   Best  
## Levels: Bad Best Better Good

#4) JobInvolvement

gsub(1, 'Low',Dataset$JobInvolvement) -> Dataset$JobInvolvement
gsub(2, 'Medium',Dataset$JobInvolvement) -> Dataset$JobInvolvement
gsub(3, 'High',Dataset$JobInvolvement) -> Dataset$JobInvolvement
gsub(4, 'Very High',Dataset$JobInvolvement) -> Dataset$JobInvolvement
Dataset$JobInvolvement %>% as.factor %>% unique
## [1] High      Medium    Very High Low      
## Levels: High Low Medium Very High


#5) EnvironmentSatisfaction

gsub(1, 'Low',Dataset$EnvironmentSatisfaction) -> Dataset$EnvironmentSatisfaction
gsub(2, 'Medium',Dataset$EnvironmentSatisfaction) -> Dataset$EnvironmentSatisfaction
gsub(3, 'High',Dataset$EnvironmentSatisfaction) -> Dataset$EnvironmentSatisfaction
gsub(4, 'Very High',Dataset$EnvironmentSatisfaction) -> Dataset$EnvironmentSatisfaction
Dataset$EnvironmentSatisfaction %>% as.factor %>% unique
## [1] Medium    High      Very High Low      
## Levels: High Low Medium Very High


#6) JobSatisfaction

gsub(1, 'Low',Dataset$JobSatisfaction) -> Dataset$JobSatisfaction
gsub(2, 'Medium',Dataset$JobSatisfaction) -> Dataset$JobSatisfaction
gsub(3, 'High',Dataset$JobSatisfaction) -> Dataset$JobSatisfaction
gsub(4, 'Very High',Dataset$JobSatisfaction) -> Dataset$JobSatisfaction
Dataset$JobSatisfaction %>% as.factor %>% unique
## [1] Very High Medium    High      Low      
## Levels: High Low Medium Very High


#7) RelationshipSatisfaction

gsub(1, 'Low',Dataset$RelationshipSatisfaction) -> Dataset$RelationshipSatisfaction
gsub(2, 'Medium',Dataset$RelationshipSatisfaction) -> Dataset$RelationshipSatisfaction
gsub(3, 'High',Dataset$RelationshipSatisfaction) -> Dataset$RelationshipSatisfaction
gsub(4, 'Very High',Dataset$RelationshipSatisfaction) -> Dataset$RelationshipSatisfaction
Dataset$RelationshipSatisfaction %>% as.factor %>% unique
## [1] Low       Very High Medium    High     
## Levels: High Low Medium Very High

# character , numeric variable 수 확인 
Dataset %>% diagnose() %>% dplyr::select(types) %>% table
## .
## character   numeric 
##        15        16

3-4. 결측치를 파악합니다.

여러가지 방법이 있지만, Graph로 표현하여 한눈에 쉽게 파악하는 것을 선호합니다.
결측치는 없는 것으로 나왔습니다.

# Missing value Check
Dataset %>% naniar::gg_miss_var()

# 1차로 변환한 데이터셋을 저장해둡니다. 

saveRDS(Dataset, "Dataset.RDS")

4. Exploratory Data Analysis(1)

4-1. 데이터를 다시 한번 진단합니다.

여러가지 EDA package들이 있지만, 저는 그중 dlookr와 ExPanDaR 패키지를 주로 사용합니다.
먼저, 관심 변수인 Attrition을 비롯한 categorical vairable 들을 살펴봅니다.
Attirtion에 Yes와 No의 비율이 크게 차이가 나는 것을 확인했기 때문에,

분류(Classification)에 있어 큰 영향을 미치는 불균형 데이터(Imbalanced Dataset) 문제를 해결해야 합니다.
- 전처리 이후, 널리 사용되고 있는 SMOTE(Synthetic Minority Over-sampling Technique)을 사용하여 oversampling 할 예정입니다.


# categorical variable 확인
Dataset %>% diagnose_category() 
## # A tibble: 57 x 6
##    variables      levels                     N  freq ratio  rank
##    <chr>          <chr>                  <int> <int> <dbl> <int>
##  1 Attrition      No                      1470  1233 83.9      1
##  2 Attrition      Yes                     1470   237 16.1      2
##  3 BusinessTravel Travel_Rarely           1470  1043 71.0      1
##  4 BusinessTravel Travel_Frequently       1470   277 18.8      2
##  5 BusinessTravel Non-Travel              1470   150 10.2      3
##  6 Department     Research & Development  1470   961 65.4      1
##  7 Department     Sales                   1470   446 30.3      2
##  8 Department     Human Resources         1470    63  4.29     3
##  9 Education      Bachelor                1470   572 38.9      1
## 10 Education      Master                  1470   398 27.1      2
## # … with 47 more rows

# Numeric variable 확인
Dataset %>% diagnose_numeric() 
## # A tibble: 16 x 10
##    variables          min    Q1     mean median     Q3   max  zero minus outlier
##    <chr>            <dbl> <dbl>    <dbl>  <dbl>  <dbl> <dbl> <int> <int>   <int>
##  1 Age                 18    30  3.69e+1    36  4.3 e1    60     0     0       0
##  2 DailyRate          102   465  8.02e+2   802  1.16e3  1499     0     0       0
##  3 DistanceFromHome     1     2  9.19e+0     7  1.4 e1    29     0     0       0
##  4 HourlyRate          30    48  6.59e+1    66  8.38e1   100     0     0       0
##  5 JobLevel             1     1  2.06e+0     2  3   e0     5     0     0       0
##  6 MonthlyIncome     1009  2911  6.50e+3  4919  8.38e3 19999     0     0     114
##  7 MonthlyRate       2094  8047  1.43e+4 14236. 2.05e4 26999     0     0       0
##  8 NumCompaniesWor…     0     1  2.69e+0     2  4   e0     9   197     0      52
##  9 PercentSalaryHi…    11    12  1.52e+1    14  1.8 e1    25     0     0       0
## 10 StockOptionLevel     0     0  7.94e-1     1  1   e0     3   631     0      85
## 11 TotalWorkingYea…     0     6  1.13e+1    10  1.5 e1    40    11     0      63
## 12 TrainingTimesLa…     0     2  2.80e+0     3  3   e0     6    54     0     238
## 13 YearsAtCompany       0     3  7.01e+0     5  9   e0    40    44     0     104
## 14 YearsInCurrentR…     0     2  4.23e+0     3  7   e0    18   244     0      21
## 15 YearsSinceLastP…     0     0  2.19e+0     1  3   e0    15   581     0     107
## 16 YearsWithCurrMa…     0     2  4.12e+0     3  7   e0    17   263     0      14

4-2. univariate outlier를 확인합니다.

outlier를 진단하고 제거하는 기준은 분석가들마다 다르고, Machine Learning을 활용하는 경우,
outlier나 noise data에 크게 영향을 받지 않기⁸ 때문에 outlier 처리를 하지 않고 바로 분석을 진행하는 경우도 있으나,
저는 일반적으로 Univariate/Multivariate Outlier를 확인하고 분석을 진행합니다.
여러가지 방식으로 Univariate Outlier를 확인할 수 있습니다.


# outlier 개수가 많은 순으로 정렬하여 변수 확인 
Dataset %>% diagnose_outlier() %>% arrange(desc(outliers_cnt))
## # A tibble: 16 x 6
##    variables    outliers_cnt outliers_ratio outliers_mean with_mean without_mean
##    <chr>               <int>          <dbl>         <dbl>     <dbl>        <dbl>
##  1 TrainingTim…          238         16.2            4.14     2.80         2.54 
##  2 MonthlyInco…          114          7.76       18400.    6503.        5503.   
##  3 YearsSinceL…          107          7.28          11.1      2.19         1.48 
##  4 YearsAtComp…          104          7.07          23.5      7.01         5.75 
##  5 StockOption…           85          5.78           3        0.794        0.658
##  6 TotalWorkin…           63          4.29          32.6     11.3         10.3  
##  7 NumCompanie…           52          3.54           9        2.69         2.46 
##  8 YearsInCurr…           21          1.43          16        4.23         4.06 
##  9 YearsWithCu…           14          0.952         16.1      4.12         4.01 
## 10 Age                     0          0            NaN       36.9         36.9  
## 11 DailyRate               0          0            NaN      802.         802.   
## 12 DistanceFro…            0          0            NaN        9.19         9.19 
## 13 HourlyRate              0          0            NaN       65.9         65.9  
## 14 JobLevel                0          0            NaN        2.06         2.06 
## 15 MonthlyRate             0          0            NaN    14313.       14313.   
## 16 PercentSala…            0          0            NaN       15.2         15.2

# outlier 비율이 5 이상인 변수 확인
Dataset %>% diagnose_outlier() %>%  filter(outliers_ratio > 5) %>% 
  mutate(rate = outliers_mean / with_mean) %>% 
  arrange(desc(rate)) %>% dplyr::select(-outliers_cnt)
## # A tibble: 5 x 6
##   variables            outliers_ratio outliers_mean with_mean without_mean  rate
##   <chr>                         <dbl>         <dbl>     <dbl>        <dbl> <dbl>
## 1 YearsSinceLastPromo…           7.28         11.1      2.19         1.48   5.09
## 2 StockOptionLevel               5.78          3        0.794        0.658  3.78
## 3 YearsAtCompany                 7.07         23.5      7.01         5.75   3.36
## 4 MonthlyIncome                  7.76      18400.    6503.        5503.     2.83
## 5 TrainingTimesLastYe…          16.2           4.14     2.80         2.54   1.48

4-4. univariate Outlier의 제거/유지를 결정합니다.

YearsSinceLastPromotion와 StockOptionLevel, YearsAtCompany, MonthlyIncome,
Training Times Last Year 변수는 전체 평균보다 이상치의 평균이 큰 것 같습니다.
이상치의 평균과 전체평균의 비율(rate)이 큰 경우에는 대체하거나 제거하는 것이 좋습니다.
하지만, 실제 업무 환경을 고려하면, 근속 연수나 스톡옵션 레벨, 승진 연차, 월급, 교육시간은
충분히 outlier가 있을 수 있고, 이러한 outlier가 실제 Attrition에 영향을 미칠 수 있습니다.
이상치가 포함된 관측치의 descriptive statistics를 보며 제거해야 하는지 확인해보거나,

Dataset %>% dplyr::select(find_outliers(.)) %>% describe()
## # A tibble: 9 x 26
##   variable         n    na    mean      sd se_mean   IQR skewness kurtosis   p00
##   <chr>        <int> <int>   <dbl>   <dbl>   <dbl> <dbl>    <dbl>    <dbl> <dbl>
## 1 MonthlyInco…  1470     0 6.50e+3 4.71e+3 1.23e+2  5468    1.37    1.01    1009
## 2 NumCompanie…  1470     0 2.69e+0 2.50e+0 6.52e-2     3    1.03    0.0102     0
## 3 StockOption…  1470     0 7.94e-1 8.52e-1 2.22e-2     1    0.969   0.365      0
## 4 TotalWorkin…  1470     0 1.13e+1 7.78e+0 2.03e-1     9    1.12    0.918      0
## 5 TrainingTim…  1470     0 2.80e+0 1.29e+0 3.36e-2     1    0.553   0.495      0
## 6 YearsAtComp…  1470     0 7.01e+0 6.13e+0 1.60e-1     6    1.76    3.94       0
## 7 YearsInCurr…  1470     0 4.23e+0 3.62e+0 9.45e-2     5    0.917   0.477      0
## 8 YearsSinceL…  1470     0 2.19e+0 3.22e+0 8.40e-2     3    1.98    3.61       0
## 9 YearsWithCu…  1470     0 4.12e+0 3.57e+0 9.31e-2     5    0.833   0.171      0
## # … with 16 more variables: p01 <dbl>, p05 <dbl>, p10 <dbl>, p20 <dbl>,
## #   p25 <dbl>, p30 <dbl>, p40 <dbl>, p50 <dbl>, p60 <dbl>, p70 <dbl>,
## #   p75 <dbl>, p80 <dbl>, p90 <dbl>, p95 <dbl>, p99 <dbl>, p100 <dbl>

이상치가 포함되었을 때와 포함되지 않았을 때 막대그래프를 통해 분포를 확인해볼 수도 있습니다.


Dataset %>%
  plot_outlier(diagnose_outlier(Dataset) %>%
                 filter(outliers_ratio >= 0.5) %>%
                 dplyr::select(variables) %>%
                 unlist())


# dlookr 패키지 기반으로 아래 코드 한번이면, 레포트로 확인하실 수 있습니다. 

# Dataset %>% diagnose_web_report()

Monthly Income은 Monthly Rate와 같이 급여 수준을 나타내는 다른 지표들도 있기에
삭제하는 것으로 결정합니다.

Dataset %>% dplyr::select(MonthlyIncome) %>% plot_box_numeric()

Dataset %>% dplyr::select(-MonthlyIncome) -> Dataset

5. Exploratory Data Analysis(2)

5-1.multivariate outlier를 확인합니다.

univariate outlier와 함께 multivariate outlier도 확인합니다.
multivariate outlier의 경우에는 Classical Mahalanobis distance를 사용하였습니다.

# Numeric variable만 추출하여 Multivariate Oultier를 구합니다. 
# cut off value = .99로 설정했습니다. 
Dataset %>% purrr::keep(is.numeric) -> outcheck_num
outcheck_num %>% chemometrics::Moutlier(quantile=.99)-> Mout

5-2.Diversity를 확인하여 제거여부를 결정합니다.

robust mahalonobis distance는 보수적이기 때문에, classical mahalanobis distance를 활용하여
데이터를 거르고, Multivariate Outler의 Diversity를 확인합니다.

# 원본 데이터셋과 다시 합침
Dataset %>% mutate(md=Mout$md)->Dataset

# Cutoff value = 6.015885 이상인 값 확인
Dataset %>% filter(md>Mout$cutoff) %>% nrow
## [1] 68

Cut off value인 Mout$cutoff보다 높은 68개 값의 Diversity를 확인합니다.

# Original Dataset Attrition ratio
Dataset %>% dplyr::select(Attrition) %>% diagnose_category()
## # A tibble: 2 x 6
##   variables levels     N  freq ratio  rank
##   <chr>     <chr>  <int> <int> <dbl> <int>
## 1 Attrition No      1470  1233  83.9     1
## 2 Attrition Yes     1470   237  16.1     2

# Multivariate Outlier Dataset Attrition ratio
Dataset %>% filter(md>Mout$cutoff) %>% dplyr::select(Attrition) %>% diagnose_category()
## # A tibble: 2 x 6
##   variables levels     N  freq ratio  rank
##   <chr>     <chr>  <int> <int> <dbl> <int>
## 1 Attrition No        68    59  86.8     1
## 2 Attrition Yes       68     9  13.2     2

multivariate outlier들의 Attrition 기준으로 Diversity가 원본 데이터와 큰 차이가 없기에 제거합니다.


# 총 68개의 Multivariate Ouliter 관측치 발견하고, outlier와 md 제거하기 

Dataset %>% filter(md<Mout$cutoff) -> Dataset
Dataset %>% dplyr::select(-md) -> Dataset

# 1차 정제된 Dataset을 다시 저장
saveRDS(Dataset,"Dataset_pre.RDS")

# 1차 정제된 데이터셋 기반으로 ExPanDar 패키지를 통해 탐색적 분석을 진행합니다
# Correlation, Scatterplot 등 파악 가능하며, Web 기반으로 동작합니다. 
# Dataset %>% ExPanD()

5-3.다양한 각도에서 Variable간 관계를 파악합니다.

실제 탐색적 분석을 진행할 때는 ExPanDar 패키지 등을 사용하여 아래와 같이 Variable간의 관계를 다양한 관점에서 파악합니다.
이것으로 Data에 대한 preprocessing과 EDA가 마무리 되었습니다.

Modeling1. Logistic Regression

1. Modeling

1-1. Logistic Regression?

Categorical DV와 IV간 Linearity 기준으로 사건의 발생 가능성을 예측할 때 사용되는 통계 기법으로
SPSS나 Jamovi 등 GUI기반의 통계 프로그램으로도 확인할 수 있는 방법입니다.
Underfitting 되는 경향이 있으며, Outlier, Noise에 민감하기에 자주 사용하지는 않습니다.

1-2. Split the Data

모델 구축 및 테스트를 위해 원 데이터를 Training set과 Test set으로 7대 3의 비율로 분할하였습니다.

# Data Import

Dataset <- readRDS("Dataset_pre.RDS")

# split the Data set and set the reference level
Dataset  %>% mutate_if(is.character, factor)->Data_glm

# Setting Reference level

Data_glm$Attrition <- relevel(Data_glm$Attrition, ref = "Yes")
Data_glm$Attrition %>% levels

## [1] "Yes" "No"

set.seed(2727)
split <- initial_split(Data_glm, prop = .7, strata = Attrition)
glm_train <- training(split)
glm_test <- testing(split)

glm_train %>% nrow

## [1] 982

glm_test %>% nrow

## [1] 420

1-3. recipe를 활용하여 한번 더 preprocessing을 진행합니다.

recipe를 활용하여 multicollinearity check, dummy coded, normalization를 진행합니다.
recipe를 보면, multicollinearity로 인해 제거된 변수는 없고,
Dummy code화와 Normalization이 잘 되었음을 확인할 수 있습니다.

# pre-processing by recipe
glm_train %>% recipe(Attrition~.) %>% 
  step_normalize(all_numeric()) %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_corr(all_predictors()) %>% prep()-> glm_recipe

glm_recipe

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         29
## 
## Training data contained 982 data points and no missing data.
## 
## Operations:
## 
## Centering and scaling for Age, DailyRate, DistanceFromHome, ... [trained]
## Dummy variables from BusinessTravel, Department, Education, ... [trained]
## Correlation filter removed no terms [trained]

# make the test data set
glm_recipe %>% juice -> glm_train_re

# bake the test data set
glm_recipe %>% bake(glm_test) -> glm_test_re

1-4. Model을 세팅하고, train data로 학습합니다.

# Model Setting
glm_model <- logistic_reg() %>% 
  set_engine('glm') %>% 
  set_mode('classification')

glm_model
## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm

# Fitting Logistic Regression

glm_fit <- glm_model %>% fit(Attrition ~., data=glm_train_re)

# Attrition에 영향을 주는 요인을 살펴봅니다. 
tidy(glm_fit, exponentiate=T) %>% filter(p.value<.05)
## # A tibble: 18 x 5
##    term                             estimate std.error statistic  p.value
##    <chr>                               <dbl>     <dbl>     <dbl>    <dbl>
##  1 Age                                 1.42      0.155      2.25 2.47e- 2
##  2 DailyRate                           1.26      0.116      1.96 4.97e- 2
##  3 DistanceFromHome                    0.688     0.111     -3.36 7.69e- 4
##  4 NumCompaniesWorked                  0.642     0.131     -3.37 7.57e- 4
##  5 YearsInCurrentRole                  1.72      0.234      2.32 2.01e- 2
##  6 YearsSinceLastPromotion             0.535     0.168     -3.72 2.02e- 4
##  7 BusinessTravel_Travel_Frequently    0.134     0.535     -3.76 1.67e- 4
##  8 BusinessTravel_Travel_Rarely        0.369     0.489     -2.04 4.14e- 2
##  9 EnvironmentSatisfaction_Low         0.285     0.325     -3.86 1.11e- 4
## 10 Gender_Male                         0.541     0.249     -2.47 1.34e- 2
## 11 JobInvolvement_Low                  0.203     0.417     -3.82 1.33e- 4
## 12 JobRole_Laboratory.Technician       0.253     0.618     -2.22 2.62e- 2
## 13 JobSatisfaction_Very.High           2.25      0.311      2.60 9.24e- 3
## 14 MaritalStatus_Single                0.310     0.441     -2.66 7.85e- 3
## 15 OverTime_Yes                        0.111     0.264     -8.33 7.77e-17
## 16 RelationshipSatisfaction_Low        0.438     0.325     -2.53 1.13e- 2
## 17 WorkLifeBalance_Better              5.11      0.440      3.71 2.07e- 4
## 18 WorkLifeBalance_Good                2.59      0.463      2.06 3.98e- 2

1-5. test를 통해 예측성능을 평가해봅니다.

Imbalanced data를 활용하여 모델을 구축하면 다수 응답 기준으로 편향되기 때문에,
Accuracy보다 F1 Score나 AUC를 기준으로 Model 성능을 평가합니다.

# Model Prediction
pre_class <- glm_fit %>% predict(new_data=glm_test_re, type="class")
pre_class %>% head
## # A tibble: 6 x 1
##   .pred_class
##   <fct>      
## 1 Yes        
## 2 No         
## 3 No         
## 4 Yes        
## 5 No         
## 6 No
pre_prob <- glm_fit %>% predict(new_data=glm_test_re, type="prob")
pre_prob %>% head
## # A tibble: 6 x 2
##   .pred_Yes .pred_No
##       <dbl>    <dbl>
## 1   0.674      0.326
## 2   0.333      0.667
## 3   0.0371     0.963
## 4   0.800      0.200
## 5   0.00743    0.993
## 6   0.0659     0.934
evaluation_tbl <- glm_test_re %>% 
  dplyr::select(Attrition) %>% bind_cols(pre_class) %>% 
  bind_cols(pre_prob)
evaluation_tbl
## # A tibble: 420 x 4
##    Attrition .pred_class     .pred_Yes .pred_No
##    <fct>     <fct>               <dbl>    <dbl>
##  1 Yes       Yes         0.674            0.326
##  2 Yes       No          0.333            0.667
##  3 No        No          0.0371           0.963
##  4 Yes       Yes         0.800            0.200
##  5 No        No          0.00743          0.993
##  6 No        No          0.0659           0.934
##  7 No        No          0.000470         1.00 
##  8 Yes       No          0.214            0.786
##  9 No        No          0.00000000245    1.00 
## 10 Yes       No          0.100            0.900
## # … with 410 more rows

Accuracy, Specificity, Recall, precision, roc_auc 값을 중심으로 결과를 확인합니다.

#Evaluation
conf_mat(evaluation_tbl, truth = Attrition, estimate = .pred_class)
##           Truth
## Prediction Yes  No
##        Yes  33  14
##        No   35 338
conf_mat(evaluation_tbl, truth = Attrition, estimate = .pred_class) %>% summary
## # A tibble: 13 x 3
##    .metric              .estimator .estimate
##    <chr>                <chr>          <dbl>
##  1 accuracy             binary         0.883
##  2 kap                  binary         0.509
##  3 sens                 binary         0.485
##  4 spec                 binary         0.960
##  5 ppv                  binary         0.702
##  6 npv                  binary         0.906
##  7 mcc                  binary         0.521
##  8 j_index              binary         0.446
##  9 bal_accuracy         binary         0.723
## 10 detection_prevalence binary         0.112
## 11 precision            binary         0.702
## 12 recall               binary         0.485
## 13 f_meas               binary         0.574
roc_auc(evaluation_tbl, truth = Attrition, .pred_Yes)
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.887
evaluation_tbl %>% roc_curve(truth=Attrition, .pred_Yes) %>% autoplot()

2. Modeling with SMOTE

2-1. Imbalanced Data를 oversampling 합니다.

Imbalanced Data는 Classification Analysis에서 다수 집단에 편향된 분류를 추정하여
소수집단의 분류정확도가 감소하게 되는 제한사항이 있습니다.⁹
이를 해결하고자 다양한 Sampling 방법이 나왔고, SMOTE, ADASYN이나 최근에는 딥러닝을 활용한 Sampling 방법도 많이 나오고 있습니다.
저는 가장 많이 활용되는 SMOTE Sampling을 활용하였습니다.
아래와 같이 전처리된 데이터를 Attrition을 기준으로 Sampling해주며, Train set과 Test set으로 나눈 뒤,
Train set에 대해서만 SMOTE를 진행합니다.

glm_train %>% as.data.frame %>% SMOTE_NC('Attrition')->glm_train_SMOTE

SMOTE sampling 된 결과와 원본 데이터를 비교하면,
Attrition의 비율이 약 84대 16에서 50대 50으로 변한 것과, 총 관측치가 2배가 된 것을 확인할 수 있습니다.

glm_train %>% dplyr::select(Attrition)  %>% diagnose_category()
## # A tibble: 2 x 6
##   variables levels     N  freq ratio  rank
##   <chr>     <chr>  <int> <int> <dbl> <int>
## 1 Attrition No       982   822  83.7     1
## 2 Attrition Yes      982   160  16.3     2
glm_train_SMOTE %>% dplyr::select(Attrition)  %>% diagnose_category()
## # A tibble: 2 x 6
##   variables levels     N  freq ratio  rank
##   <chr>     <chr>  <int> <int> <dbl> <int>
## 1 Attrition No      1644   822    50     1
## 2 Attrition Yes     1644   822    50     1

2-2. Model을 세팅하고, train data로 학습합니다.

oversampling된 train data set을 기반으로 modeling을 다시 합니다.
이번에는 multicollinearity로 인해 변수가 제거되었음을 확인할 수 있습니다.



# pre-processing by recipe
glm_train_SMOTE %>% recipe(Attrition~.) %>% 
  step_normalize(all_numeric()) %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_corr(all_predictors()) %>% prep()-> glm_recipe

glm_recipe
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         29
## 
## Training data contained 1644 data points and no missing data.
## 
## Operations:
## 
## Centering and scaling for Age, DailyRate, DistanceFromHome, ... [trained]
## Dummy variables from BusinessTravel, Department, Education, ... [trained]
## Correlation filter removed Department_Research...Development [trained]

# make the test data set
glm_recipe %>% juice -> glm_train_re

# bake the train data set
glm_recipe %>% bake(glm_test) -> glm_test_re


# Model Setting
glm_model <- logistic_reg() %>% 
  set_engine('glm') %>% 
  set_mode('classification')

glm_model
## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm

SMOTE 된 train 데이터로 다시 학습하여 모델을 구성합니다.

# Fitting Logistic Regression

glm_fit <- glm_model %>% fit(Attrition ~., data=glm_train_re)

tidy(glm_fit, exponentiate=T) %>% filter(p.value<.05)

## # A tibble: 32 x 5
##    term                    estimate std.error statistic      p.value
##    <chr>                      <dbl>     <dbl>     <dbl>        <dbl>
##  1 (Intercept)              190.       0.974       5.39 0.0000000706
##  2 Age                        1.31     0.0936      2.88 0.00403     
##  3 DailyRate                  1.42     0.0786      4.48 0.00000758  
##  4 DistanceFromHome           0.677    0.0795     -4.91 0.000000930 
##  5 NumCompaniesWorked         0.687    0.0852     -4.40 0.0000106   
##  6 PercentSalaryHike          0.816    0.0957     -2.12 0.0336      
##  7 TotalWorkingYears          1.42     0.136       2.57 0.0102      
##  8 TrainingTimesLastYear      1.18     0.0759      2.21 0.0273      
##  9 YearsInCurrentRole         1.56     0.137       3.24 0.00121     
## 10 YearsSinceLastPromotion    0.576    0.103      -5.37 0.0000000800
## # … with 22 more rows

# Model Prediction
pre_class2 <- glm_fit %>% predict(new_data=glm_test_re, type="class")
pre_class2 %>% head

## # A tibble: 6 x 1
##   .pred_class
##   <fct>      
## 1 Yes        
## 2 Yes        
## 3 No         
## 4 Yes        
## 5 No         
## 6 No

pre_prob2 <- glm_fit %>% predict(new_data=glm_test_re, type="prob")
pre_prob2 %>% head

## # A tibble: 6 x 2
##   .pred_Yes .pred_No
##       <dbl>    <dbl>
## 1   0.581     0.419 
## 2   0.645     0.355 
## 3   0.102     0.898 
## 4   0.985     0.0154
## 5   0.00205   0.998 
## 6   0.00592   0.994

evaluation_tbl2 <- glm_test_re %>% 
  dplyr::select(Attrition) %>% bind_cols(pre_class2) %>% 
  bind_cols(pre_prob2)
evaluation_tbl2

## # A tibble: 420 x 4
##    Attrition .pred_class    .pred_Yes .pred_No
##    <fct>     <fct>              <dbl>    <dbl>
##  1 Yes       Yes         0.581          0.419 
##  2 Yes       Yes         0.645          0.355 
##  3 No        No          0.102          0.898 
##  4 Yes       Yes         0.985          0.0154
##  5 No        No          0.00205        0.998 
##  6 No        No          0.00592        0.994 
##  7 No        No          0.00132        0.999 
##  8 Yes       No          0.372          0.628 
##  9 No        No          0.0000000108   1.00  
## 10 Yes       No          0.499          0.501 
## # … with 410 more rows

2-3. 오히려 성능이 좋지 않음을 확인할 수 있습니다.

oversampling 데이터의 경우 overfitting으로 인해 성능이 더 좋지 않게 나올 수 있기에
Logistic Regression은 oversampling 없이 진행하며,
Model Evaluation을 Accuracy가 아닌, F1 score와 AUC 기반으로 진행합니다.

# Evaluation
conf_mat(evaluation_tbl2, truth = Attrition, estimate = .pred_class)

##           Truth
## Prediction Yes  No
##        Yes  46  65
##        No   22 287

conf_mat(evaluation_tbl2, truth = Attrition, estimate = .pred_class) %>% summary

## # A tibble: 13 x 3
##    .metric              .estimator .estimate
##    <chr>                <chr>          <dbl>
##  1 accuracy             binary         0.793
##  2 kap                  binary         0.392
##  3 sens                 binary         0.676
##  4 spec                 binary         0.815
##  5 ppv                  binary         0.414
##  6 npv                  binary         0.929
##  7 mcc                  binary         0.411
##  8 j_index              binary         0.492
##  9 bal_accuracy         binary         0.746
## 10 detection_prevalence binary         0.264
## 11 precision            binary         0.414
## 12 recall               binary         0.676
## 13 f_meas               binary         0.514

roc_auc(evaluation_tbl2, truth = Attrition, .pred_Yes)

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.830

evaluation_tbl %>% roc_curve(truth=Attrition, .pred_Yes) %>% autoplot()

3. Modeling with backward selection

3-1. stepwise logistic regression을 통해 predictor를 선택하여 Model 성능을 개선합니다.

Oversampling되지 않은 데이터로 다시 진행합니다.

# Data Import

Dataset <- readRDS("Dataset_pre.RDS")

# split the Data set and set the reference level
Dataset  %>% mutate_if(is.character, factor)->Data_glm

# Setting Reference level

Data_glm$Attrition <- relevel(Data_glm$Attrition, ref = "Yes")
Data_glm$Attrition %>% levels

## [1] "Yes" "No"

set.seed(2727)
split <- initial_split(Data_glm, prop = .7, strata = Attrition)
glm_train <- training(split)
glm_test <- testing(split)

glm_train %>% nrow

## [1] 982

glm_test %>% nrow

## [1] 420

# pre-processing by recipe
glm_train %>% recipe(Attrition~.) %>% 
  step_normalize(all_numeric()) %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_corr(all_predictors()) %>% prep()-> glm_recipe

glm_recipe

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         29
## 
## Training data contained 982 data points and no missing data.
## 
## Operations:
## 
## Centering and scaling for Age, DailyRate, DistanceFromHome, ... [trained]
## Dummy variables from BusinessTravel, Department, Education, ... [trained]
## Correlation filter removed no terms [trained]

# make the test data set
glm_recipe %>% juice -> glm_train_re

# bake the test data set
glm_recipe %>% bake(glm_test) -> glm_test_re

Full model로 시작하여 predictor를 제거하는 backward selection을 활용합니다.

# Model Improvement

glm(Attrition~., family = 'binomial', data=glm_train_re) %>% 
  MASS::stepAIC(direction = "backward") -> step_glm

glm_fit_mod <- glm_model %>% fit(step_glm$formula, data=glm_train_re)


# Improved Model Prediction

pre_class_re <-  glm_fit_mod %>% predict(new_data=glm_test_re, type="class")
pre_class_re %>% head
pre_prob_re <-  glm_fit_mod %>% predict(new_data=glm_test_re, type="prob")
pre_prob_re %>% head
evaluation_tbl_mod <- glm_test_re %>% 
  dplyr::select(Attrition) %>% bind_cols(pre_class_re) %>% 
  bind_cols(pre_prob_re)

evaluation_tbl_mod

3-2. 평가 결과, AUC 기준으로 모델 성능이 개선되었음을 알 수 있었습니다.

# Improved Model Evaluation

conf_mat(evaluation_tbl_mod, truth = Attrition, estimate = .pred_class)

##           Truth
## Prediction Yes  No
##        Yes  31  14
##        No   37 338

conf_mat(evaluation_tbl_mod, truth = Attrition, estimate = .pred_class) %>% summary

## # A tibble: 13 x 3
##    .metric              .estimator .estimate
##    <chr>                <chr>          <dbl>
##  1 accuracy             binary         0.879
##  2 kap                  binary         0.482
##  3 sens                 binary         0.456
##  4 spec                 binary         0.960
##  5 ppv                  binary         0.689
##  6 npv                  binary         0.901
##  7 mcc                  binary         0.496
##  8 j_index              binary         0.416
##  9 bal_accuracy         binary         0.708
## 10 detection_prevalence binary         0.107
## 11 precision            binary         0.689
## 12 recall               binary         0.456
## 13 f_meas               binary         0.549

roc_auc(evaluation_tbl_mod, truth = Attrition, .pred_Yes)

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.880

evaluation_tbl_mod %>% roc_curve(truth=Attrition, .pred_Yes) %>% autoplot()

4. variable importance

4-1 변수의 중요도를 구합니다.

Logistic regression에서 variable importance를 나타내는 지표는 여러가지가 있으나,
본 연구에서는 vip 패키지에 있는 ‘vi()’ 함수를 활용하였습니다.
퇴직에 영향을 주는 변수들에 대해서는 발표에서 조금 더 자세히 나누겠습니다.

vip::vi(glm_fit_mod)
## # A tibble: 26 x 3
##    Variable                         Importance Sign 
##    <chr>                                 <dbl> <chr>
##  1 OverTime_Yes                           8.47 NEG  
##  2 EnvironmentSatisfaction_Low            5.00 NEG  
##  3 MaritalStatus_Single                   4.62 NEG  
##  4 BusinessTravel_Travel_Frequently       3.86 NEG  
##  5 YearsSinceLastPromotion                3.77 NEG  
##  6 NumCompaniesWorked                     3.77 NEG  
##  7 JobInvolvement_Low                     3.71 NEG  
##  8 EducationField_Life.Sciences           3.65 POS  
##  9 WorkLifeBalance_Better                 3.61 POS  
## 10 EducationField_Medical                 3.44 POS  
## # … with 16 more rows
vip(glm_fit_mod)

Modeling2. RandomForest

1. preprocessing

1-1. RandomForest?

Decision Tree의 Overfitting 경향을 극복하기 위해 다수의 학습 알고리즘을 사용하는
ensemble Machine Learning Model로, classification을 위한 대표적인 알고리즘이지만
hyperparameter가 많아 튜닝을 하는데 시간이 걸린다는 단점이 있습니다.
최근에는 다양한 머신러닝 기법을 자동으로 적용하여 도입하는 AutoML이나
LightGBM, XGboost등의 알고리즘이 나와 랜덤포레스트를 대체하고 있는 것 같습니다.

1-2. Data를 준비합니다.

학습을 하는데, 다수집단에 편향된 학습을 하지 않도록, SMOTE로 Oversample된 데이터를 활용합니다.


Dataset <- readRDS("Dataset_pre.RDS")

# split the Data set and set the reference level
Dataset  %>% mutate_if(is.character, factor)->Data_rnd

# Setting Reference level

Data_rnd$Attrition <- relevel(Data_rnd$Attrition, ref = "Yes")
Data_rnd$Attrition %>% levels
## [1] "Yes" "No"

set.seed(2727)
split <- initial_split(Data_rnd, prop = .7, strata = Attrition)
rnd_train <- training(split)
rnd_test <- testing(split)

rnd_train %>% as.data.frame %>% SMOTE_NC('Attrition')->rnd_train_SMOTE

# pre-processing by recipe
rnd_train_SMOTE %>% recipe(Attrition~.) %>% 
  step_normalize(all_numeric()) %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_corr(all_predictors()) %>% prep()-> rnd_recipe

rnd_recipe
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         29
## 
## Training data contained 1644 data points and no missing data.
## 
## Operations:
## 
## Centering and scaling for Age, DailyRate, DistanceFromHome, ... [trained]
## Dummy variables from BusinessTravel, Department, Education, ... [trained]
## Correlation filter removed Department_Research...Development [trained]

# make the test data set
rnd_recipe %>% juice -> rnd_train_re

# bake the train data set
rnd_recipe %>% bake(rnd_test) -> rnd_test_re

1-3. Validation을 위해 Data set을 구성합니다.

# make validation set
set.seed(2727)
data_fold <- vfold_cv(rnd_train_re)

2. Modeling

2-1. Hyperparameter를 setting합니다.

# hyperparameter tune : mtry와 min_n만 설정, 개인 컴퓨터 core는 8개라 병렬 처리위한 thread는 6으로 설정
tune_spec <- rand_forest(mtry=tune(), trees = 1000, min_n = tune()) %>% 
  set_mode("classification") %>% set_engine('ranger', importance='impurity',seed=2727, num.threads=6)

tune_spec

## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = tune()
##   trees = 1000
##   min_n = tune()
## 
## Engine-Specific Arguments:
##   importance = impurity
##   seed = 2727
##   num.threads = 6
## 
## Computational engine: ranger

2-2. workflow를 설정해줍니다.

workflow() %>%
  add_model(tune_spec) %>% 
  add_formula(Attrition ~ .)-> workflow

workflow

## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Formula
## Model: rand_forest()
## 
## ─ Preprocessor ────────────────────────────────
## Attrition ~ .
## 
## ─ Model ────────────────────────────────────
## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = tune()
##   trees = 1000
##   min_n = tune()
## 
## Engine-Specific Arguments:
##   importance = impurity
##   seed = 2727
##   num.threads = 6
## 
## Computational engine: ranger

2-3. hyperparameter tuning을 위한 grid search를 진행합니다.

rnd_model<- workflow %>% 
  tune_grid(data_fold, 
          grid=20, 
          control=control_grid(save_pred = TRUE), 
          metrics=metric_set(roc_auc, sens, spec, recall, accuracy, precision, f_meas))
 

# Graph for hyperparameter tuning
rnd_model %>%
  collect_metrics() %>%
  filter(.metric == "roc_auc") %>%
  dplyr::select(mean, min_n, mtry) %>%
  pivot_longer(min_n:mtry,
               values_to = "value",
               names_to = "parameter") %>%
  ggplot(aes(value, mean, color = parameter)) +
  geom_point(show.legend = FALSE) +
  facet_wrap(~parameter, scales = "free_x") +
  labs(x = NULL, y = "AUC")

rnd_model %>%
  collect_metrics()

## # A tibble: 140 x 8
##     mtry min_n .metric   .estimator  mean     n std_err .config              
##    <int> <int> <chr>     <chr>      <dbl> <int>   <dbl> <chr>                
##  1    46    18 accuracy  binary     0.886    10 0.00767 Preprocessor1_Model01
##  2    46    18 f_meas    binary     0.880    10 0.00862 Preprocessor1_Model01
##  3    46    18 precision binary     0.918    10 0.0131  Preprocessor1_Model01
##  4    46    18 recall    binary     0.848    10 0.0160  Preprocessor1_Model01
##  5    46    18 roc_auc   binary     0.956    10 0.00392 Preprocessor1_Model01
##  6    46    18 sens      binary     0.848    10 0.0160  Preprocessor1_Model01
##  7    46    18 spec      binary     0.926    10 0.0108  Preprocessor1_Model01
##  8    20    29 accuracy  binary     0.894    10 0.00712 Preprocessor1_Model02
##  9    20    29 f_meas    binary     0.888    10 0.00873 Preprocessor1_Model02
## 10    20    29 precision binary     0.928    10 0.0115  Preprocessor1_Model02
## # … with 130 more rows

2-4. roc_auc 기반으로 찾아낸 최적의 hyperparameter를 세팅합니다.

# Logistic Regression은 recall로 퇴직자에 대한 예측 정확도만 고려했으나, 
# random forest부터는 recall과 재직자도 고려한 specificity를 동시에 고려하는 AUC 고려
rnd_model %>% select_best('roc_auc')->param_best

3. Model Evalutation

3-1. roc_auc 기반 best grid를 기반으로 model fitting을 진행합니다.

  tune_spec %>% finalize_model(param_best)->rnd_best_model

3-1. workflow를 update하고 metrics를 확인합니다.

# workflow update  
  workflow %>% finalize_workflow(param_best) -> workflow_final

  workflow_final %>% last_fit(split, metrics=metric_set(roc_auc, sens, spec, recall, accuracy, precision, f_meas))->rnd_best_fit2
  
  rnd_best_fit2 %>% collect_predictions() %>%   
    conf_mat(truth = Attrition, estimate=.pred_class)

##           Truth
## Prediction Yes  No
##        Yes  13  12
##        No   55 340

  rnd_best_fit2 %>% collect_predictions() %>% roc_curve(truth=Attrition, estimate=.pred_Yes) %>% autoplot()

3-2. Variable Importance를 확인합니다.

deploy_randf <- fit(workflow_final, Data_glm)

pull_workflow_fit(deploy_randf)$fit %>% vip::vi()

## # A tibble: 29 x 2
##    Variable           Importance
##    <chr>                   <dbl>
##  1 Age                      28.4
##  2 OverTime                 27.0
##  3 DailyRate                24.0
##  4 TotalWorkingYears        22.3
##  5 DistanceFromHome         22.1
##  6 HourlyRate               21.1
##  7 MonthlyRate              19.9
##  8 YearsAtCompany           14.1
##  9 NumCompaniesWorked       13.4
## 10 PercentSalaryHike        13.2
## # … with 19 more rows

# Importance값이 10 이상인 Variable들만 갖고 Randomforest 다시 돌려보고자 합니다.

4. Model Improvement

4-1. 앞서 stepwise selection으로 선정된 variable만 사용합니다.

4-2. workflow에 step_glm의 formula를 추가해줍니다.



Dataset <- readRDS("Dataset_pre.RDS")

# split the Data set and set the reference level
Dataset  %>% mutate_if(is.character, factor)->Data_rnd

# Setting Reference level

Data_rnd$Attrition <- relevel(Data_rnd$Attrition, ref = "Yes")
Data_rnd$Attrition %>% levels
## [1] "Yes" "No"

# pre-processing by recipe
Dataset %>% recipe(Attrition~.) %>% 
  step_normalize(all_numeric()) %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_corr(all_predictors()) %>% prep()-> rnd_recipe_re

rnd_recipe_re %>% juice-> rnd_dataset

set.seed(2727)
split_re <- initial_split(rnd_dataset, prop = .7, strata = Attrition)
rnd_train <- training(split_re)
rnd_test <- testing(split_re)
rnd_train %>% as.data.frame %>% SMOTE('Attrition')->rnd_train_SMOTE

rnd_train %>% dplyr::select(Attrition, Age , DailyRate , DistanceFromHome , NumCompaniesWorked , 
    TotalWorkingYears , TrainingTimesLastYear , YearsInCurrentRole , 
    YearsSinceLastPromotion , BusinessTravel_Travel_Frequently , 
    BusinessTravel_Travel_Rarely , EducationField_Life.Sciences , 
    EducationField_Medical , EducationField_Other , EnvironmentSatisfaction_Low , 
    Gender_Male , JobInvolvement_Low , JobInvolvement_Very.High , 
    JobRole_Laboratory.Technician , JobRole_Research.Director , 
    JobRole_Sales.Representative , JobSatisfaction_Low , JobSatisfaction_Very.High , 
    MaritalStatus_Single , OverTime_Yes , RelationshipSatisfaction_Low , 
    WorkLifeBalance_Better)-> rnd_train_re_sel

rnd_test  %>% dplyr::select(Attrition, Age , DailyRate , DistanceFromHome , NumCompaniesWorked , 
    TotalWorkingYears , TrainingTimesLastYear , YearsInCurrentRole , 
    YearsSinceLastPromotion , BusinessTravel_Travel_Frequently , 
    BusinessTravel_Travel_Rarely , EducationField_Life.Sciences , 
    EducationField_Medical , EducationField_Other , EnvironmentSatisfaction_Low , 
    Gender_Male , JobInvolvement_Low , JobInvolvement_Very.High , 
    JobRole_Laboratory.Technician , JobRole_Research.Director , 
    JobRole_Sales.Representative , JobSatisfaction_Low , JobSatisfaction_Very.High , 
    MaritalStatus_Single , OverTime_Yes , RelationshipSatisfaction_Low , 
    WorkLifeBalance_Better)-> rnd_test_re_sel

# Validate data again
data_fold2 <- vfold_cv(rnd_train_re_sel)

# Workflow setting
workflow() %>%
  add_model(tune_spec) %>% 
  add_formula(step_glm$formula)-> workflow2

workflow2
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Formula
## Model: rand_forest()
## 
## ─ Preprocessor ────────────────────────────────
## Attrition ~ Age + DailyRate + DistanceFromHome + NumCompaniesWorked + 
##     TotalWorkingYears + TrainingTimesLastYear + YearsInCurrentRole + 
##     YearsSinceLastPromotion + BusinessTravel_Travel_Frequently + 
##     BusinessTravel_Travel_Rarely + EducationField_Life.Sciences + 
##     EducationField_Medical + EducationField_Other + EnvironmentSatisfaction_Low + 
##     Gender_Male + JobInvolvement_Low + JobInvolvement_Very.High + 
##     JobRole_Laboratory.Technician + JobRole_Research.Director + 
##     JobRole_Sales.Representative + JobSatisfaction_Low + JobSatisfaction_Very.High + 
##     MaritalStatus_Single + OverTime_Yes + RelationshipSatisfaction_Low + 
##     WorkLifeBalance_Better
## 
## ─ Model ────────────────────────────────────
## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = tune()
##   trees = 1000
##   min_n = tune()
## 
## Engine-Specific Arguments:
##   importance = impurity
##   seed = 2727
##   num.threads = 6
## 
## Computational engine: ranger

# hyperparameter tune
rnd_model_mod<- workflow2 %>% 
  tune_grid(data_fold2, 
          grid=20, 
          control=control_grid(save_pred = TRUE), 
          metrics=metric_set(roc_auc, sens, spec, recall, accuracy, precision, f_meas))
 

# Graph for hyperparameter tuning
rnd_model_mod %>%
  collect_metrics() %>%
  filter(.metric == "roc_auc") %>%
  dplyr::select(mean, min_n, mtry) %>%
  pivot_longer(min_n:mtry,
               values_to = "value",
               names_to = "parameter") %>%
  ggplot(aes(value, mean, color = parameter)) +
  geom_point(show.legend = FALSE) +
  facet_wrap(~parameter, scales = "free_x") +
  labs(x = NULL, y = "AUC")


rnd_model_mod %>%
  collect_metrics()
## # A tibble: 140 x 8
##     mtry min_n .metric   .estimator  mean     n std_err .config              
##    <int> <int> <chr>     <chr>      <dbl> <int>   <dbl> <chr>                
##  1     5    12 accuracy  binary     0.857    10 0.0115  Preprocessor1_Model01
##  2     5    12 f_meas    binary     0.921    10 0.00716 Preprocessor1_Model01
##  3     5    12 precision binary     0.857    10 0.0126  Preprocessor1_Model01
##  4     5    12 recall    binary     0.995    10 0.00195 Preprocessor1_Model01
##  5     5    12 roc_auc   binary     0.804    10 0.0183  Preprocessor1_Model01
##  6     5    12 sens      binary     0.995    10 0.00195 Preprocessor1_Model01
##  7     5    12 spec      binary     0.149    10 0.00974 Preprocessor1_Model01
##  8     5    37 accuracy  binary     0.852    10 0.0132  Preprocessor1_Model02
##  9     5    37 f_meas    binary     0.918    10 0.00799 Preprocessor1_Model02
## 10     5    37 precision binary     0.851    10 0.0136  Preprocessor1_Model02
## # … with 130 more rows

# Logistic Regression은 recall로 퇴직자에 대한 예측 정확도만 고려했으나, 
# random forest부터는 recall과 재직자도 고려한 specificity를 동시에 고려하는 AUC 고려
rnd_model_mod %>% select_best('roc_auc')->param_best_mod

4-3. workflow를 업데이트 하고, 최종 모델의 성능을 평가합니다.

tune_spec %>% finalize_model(param_best_mod)->rnd_best_model


# workflow update  
  workflow2 %>% finalize_workflow(param_best_mod) -> workflow_final2

  workflow_final2 %>% last_fit(split_re, metrics=metric_set(roc_auc, sens, spec, recall, accuracy, precision, f_meas))->rnd_best_fit3
  
  rnd_best_fit3 %>% collect_predictions() %>%   
    conf_mat(truth = Attrition, estimate=.pred_class)

##           Truth
## Prediction  No Yes
##        No  351  65
##        Yes   1   3

  rnd_best_fit3 %>% collect_metrics() %>% arrange(desc(.estimate))

## # A tibble: 7 x 4
##   .metric   .estimator .estimate .config             
##   <chr>     <chr>          <dbl> <chr>               
## 1 sens      binary        0.997  Preprocessor1_Model1
## 2 recall    binary        0.997  Preprocessor1_Model1
## 3 f_meas    binary        0.914  Preprocessor1_Model1
## 4 precision binary        0.844  Preprocessor1_Model1
## 5 accuracy  binary        0.843  Preprocessor1_Model1
## 6 roc_auc   binary        0.834  Preprocessor1_Model1
## 7 spec      binary        0.0441 Preprocessor1_Model1

4-4. 모형이 개선되지 않았음을 확인했습니다.

처음 진행한 randomforest model로 deploy 합니다.

  # model deploy
  deploy_randf <- fit(workflow_final, Data_glm)
  deploy_randf

## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Formula
## Model: rand_forest()
## 
## ─ Preprocessor ────────────────────────────────
## Attrition ~ .
## 
## ─ Model ────────────────────────────────────
## Ranger result
## 
## Call:
##  ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~13L,      x), num.trees = ~1000, min.node.size = min_rows(~5L, x),      importance = ~"impurity", seed = ~2727, num.threads = ~6,      verbose = FALSE, probability = TRUE) 
## 
## Type:                             Probability estimation 
## Number of trees:                  1000 
## Sample size:                      1402 
## Number of independent variables:  29 
## Mtry:                             13 
## Target node size:                 5 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.1090518

Modeling3. h2o.AutoML

1. preprocessing

1-1. AutoML?

AutoML은 자동화된 머신러닝으로 hyperparameter 최적화와 적합한 ML Model을 찾는 과정을
자동화해줍니다.
데이터 전처리부터 Feature Engineering, 다양한 ML Model에 대한 비교까지 가능한
Amazon Sagemaker, Azure Machine Learning과 같은 유료 서비스가 있으며,
저는 Open Source 기반의 H2O를 사용하여 분석했습니다.

1-2. Preprocessing

H2O에 맞는 형식으로 Data를 바꿔줍니다.
H2O에서는 train, test, validation set으로 split 했습니다.

h2o.init()
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 hours 7 minutes 
##     H2O cluster timezone:       Asia/Seoul 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.32.1.3 
##     H2O cluster version age:    3 months and 6 days  
##     H2O cluster name:           H2O_started_from_R_raymondkim_eou682 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.20 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  8 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4 
##     R Version:                  R version 4.0.5 (2021-03-31)

Dataset <- readRDS("Dataset_pre.RDS")

Dataset  %>% mutate_if(is.character, factor)->Data_auto

# Setting Reference level
Data_auto$Attrition <- relevel(Data_auto$Attrition, ref = "Yes")

Data_auto %>% recipe(Attrition~.) %>% 
  step_normalize(all_numeric()) %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_corr(all_numeric()) %>% prep()-> h2o_recipe


h2o_recipe %>% juice -> Dataset_h2o

# Putting the original dataframe into an h2o format
Dataset_h2o %>% as.h2o(destination_frame = "h2o_df")->h2o_df
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

# Splitting into training, validation and testing sets
split_df <- h2o.splitFrame(h2o_df, c(0.7, 0.15), seed=12)

# Obtaining our three types of sets into three separate values
h2o_train <- h2o.assign(split_df[[1]], "train")
h2o_validation <- h2o.assign(split_df[[2]], "validation")
h2o_test <- h2o.assign(split_df[[2]], "test")


h2o.describe(h2o_train)
##                                 Label Type Missing Zeros PosInf NegInf
## 1                                 Age real       0     0      0      0
## 2                           DailyRate real       0     0      0      0
## 3                    DistanceFromHome real       0     0      0      0
## 4                          HourlyRate real       0     0      0      0
## 5                            JobLevel real       0     0      0      0
## 6                         MonthlyRate real       0     0      0      0
## 7                  NumCompaniesWorked real       0     0      0      0
## 8                   PercentSalaryHike real       0     0      0      0
## 9                    StockOptionLevel real       0     0      0      0
## 10                  TotalWorkingYears real       0     0      0      0
## 11              TrainingTimesLastYear real       0     0      0      0
## 12                     YearsAtCompany real       0     0      0      0
## 13                 YearsInCurrentRole real       0     0      0      0
## 14            YearsSinceLastPromotion real       0     0      0      0
## 15               YearsWithCurrManager real       0     0      0      0
## 16                          Attrition enum       0   838      0      0
## 17   BusinessTravel_Travel_Frequently  int       0   825      0      0
## 18       BusinessTravel_Travel_Rarely  int       0   283      0      0
## 19                   Department_Sales  int       0   681      0      0
## 20            Education_below.College  int       0   879      0      0
## 21                  Education_College  int       0   785      0      0
## 22                   Education_Doctor  int       0   958      0      0
## 23                   Education_Master  int       0   739      0      0
## 24       EducationField_Life.Sciences  int       0   577      0      0
## 25           EducationField_Marketing  int       0   889      0      0
## 26             EducationField_Medical  int       0   678      0      0
## 27               EducationField_Other  int       0   940      0      0
## 28    EducationField_Technical.Degree  int       0   904      0      0
## 29        EnvironmentSatisfaction_Low  int       0   801      0      0
## 30     EnvironmentSatisfaction_Medium  int       0   802      0      0
## 31  EnvironmentSatisfaction_Very.High  int       0   687      0      0
## 32                        Gender_Male  int       0   402      0      0
## 33                 JobInvolvement_Low  int       0   945      0      0
## 34              JobInvolvement_Medium  int       0   736      0      0
## 35           JobInvolvement_Very.High  int       0   893      0      0
## 36            JobRole_Human.Resources  int       0   960      0      0
## 37      JobRole_Laboratory.Technician  int       0   811      0      0
## 38                    JobRole_Manager  int       0   941      0      0
## 39     JobRole_Manufacturing.Director  int       0   885      0      0
## 40          JobRole_Research.Director  int       0   944      0      0
## 41         JobRole_Research.Scientist  int       0   795      0      0
## 42            JobRole_Sales.Executive  int       0   762      0      0
## 43       JobRole_Sales.Representative  int       0   930      0      0
## 44                JobSatisfaction_Low  int       0   809      0      0
## 45             JobSatisfaction_Medium  int       0   793      0      0
## 46          JobSatisfaction_Very.High  int       0   679      0      0
## 47              MaritalStatus_Married  int       0   539      0      0
## 48               MaritalStatus_Single  int       0   672      0      0
## 49                       OverTime_Yes  int       0   719      0      0
## 50      PerformanceRating_Outstanding  int       0   847      0      0
## 51       RelationshipSatisfaction_Low  int       0   799      0      0
## 52    RelationshipSatisfaction_Medium  int       0   783      0      0
## 53 RelationshipSatisfaction_Very.High  int       0   708      0      0
## 54               WorkLifeBalance_Best  int       0   886      0      0
## 55             WorkLifeBalance_Better  int       0   395      0      0
## 56               WorkLifeBalance_Good  int       0   765      0      0
##           Min      Max         Mean     Sigma Cardinality
## 1  -2.0784303 2.685350  0.015732536 1.0072595          NA
## 2  -1.7519591 1.722043 -0.006184071 0.9943607          NA
## 3  -1.0113326 2.484078  0.001687752 1.0064210          NA
## 4  -1.7707175 1.676989 -0.013879977 0.9941769          NA
## 5  -0.9502098 2.905634 -0.009546925 0.9905219          NA
## 6  -1.7264474 1.799486 -0.006659469 1.0059270          NA
## 7  -1.0754126 2.542171  0.007798073 1.0042799          NA
## 8  -1.1608358 2.710195 -0.012502504 1.0029059          NA
## 9  -0.9287806 2.612880 -0.029990700 0.9763015          NA
## 10 -1.5239202 3.707345 -0.004939040 1.0006931          NA
## 11 -2.1847628 2.498781 -0.010428912 0.9911376          NA
## 12 -1.2771588 3.788408 -0.009389695 0.9921338          NA
## 13 -1.1788794 3.826003 -0.012229704 0.9933579          NA
## 14 -0.6928449 4.421544 -0.044992894 0.9661122          NA
## 15 -1.1627758 3.583973 -0.005964821 0.9948860          NA
## 16  0.0000000 1.000000  0.156092649 0.3631260           2
## 17  0.0000000 1.000000  0.169184290 0.3751035          NA
## 18  0.0000000 1.000000  0.715005035 0.4516395          NA
## 19  0.0000000 1.000000  0.314199396 0.4644301          NA
## 20  0.0000000 1.000000  0.114803625 0.3189454          NA
## 21  0.0000000 1.000000  0.209466264 0.4071327          NA
## 22  0.0000000 1.000000  0.035246727 0.1844957          NA
## 23  0.0000000 1.000000  0.255790534 0.4365245          NA
## 24  0.0000000 1.000000  0.418932528 0.4936329          NA
## 25  0.0000000 1.000000  0.104733132 0.3063635          NA
## 26  0.0000000 1.000000  0.317220544 0.4656286          NA
## 27  0.0000000 1.000000  0.053373615 0.2248907          NA
## 28  0.0000000 1.000000  0.089627392 0.2857911          NA
## 29  0.0000000 1.000000  0.193353474 0.3951267          NA
## 30  0.0000000 1.000000  0.192346425 0.3943423          NA
## 31  0.0000000 1.000000  0.308157100 0.4619645          NA
## 32  0.0000000 1.000000  0.595166163 0.4911072          NA
## 33  0.0000000 1.000000  0.048338369 0.2145883          NA
## 34  0.0000000 1.000000  0.258811682 0.4382027          NA
## 35  0.0000000 1.000000  0.100704935 0.3010893          NA
## 36  0.0000000 1.000000  0.033232628 0.1793338          NA
## 37  0.0000000 1.000000  0.183282981 0.3870933          NA
## 38  0.0000000 1.000000  0.052366566 0.2228774          NA
## 39  0.0000000 1.000000  0.108761329 0.3114964          NA
## 40  0.0000000 1.000000  0.049345418 0.2166973          NA
## 41  0.0000000 1.000000  0.199395770 0.3997474          NA
## 42  0.0000000 1.000000  0.232628399 0.4227202          NA
## 43  0.0000000 1.000000  0.063444109 0.2438829          NA
## 44  0.0000000 1.000000  0.185297080 0.3887342          NA
## 45  0.0000000 1.000000  0.201409869 0.4012556          NA
## 46  0.0000000 1.000000  0.316213494 0.4652316          NA
## 47  0.0000000 1.000000  0.457200403 0.4984159          NA
## 48  0.0000000 1.000000  0.323262840 0.4679578          NA
## 49  0.0000000 1.000000  0.275931521 0.4472077          NA
## 50  0.0000000 1.000000  0.147029204 0.3543135          NA
## 51  0.0000000 1.000000  0.195367573 0.3966832          NA
## 52  0.0000000 1.000000  0.211480363 0.4085640          NA
## 53  0.0000000 1.000000  0.287009063 0.4525938          NA
## 54  0.0000000 1.000000  0.107754280 0.3102261          NA
## 55  0.0000000 1.000000  0.602215509 0.4896871          NA
## 56  0.0000000 1.000000  0.229607251 0.4207922          NA

2.Modeling

2-1. AutoML을 진행합니다.

# Establish X and Y (Features and Labels)
y <- "Attrition"
x <- setdiff(names(h2o_train), y)

auto_ml <- h2o.automl(
    y = y,
    x = x,
    training_frame = h2o_train,
    leaderboard_frame = h2o_validation,
    project_name = "Attrition",
    max_models = 10,
    seed = 1
)

## 
  |                                                                            
  |                                                                      |   0%
## 12:35:25.425: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
## 12:35:25.428: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 12 models.
## 12:36:22.957: StackedEnsemble_AllModels_AutoML_20210826_123525 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 12:37:29.811: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
## 12:37:29.828: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 23 models.
## 12:38:05.936: Skipping training of model GBM_5_AutoML_20210826_123729 due to exception: water.exceptions.H2OModelBuilderIllegalArgumentException: Illegal argument(s) for GBM model: GBM_5_AutoML_20210826_123729.  Details: ERRR on field: _min_rows: The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 158.0.
## 
## 12:38:10.40: StackedEnsemble_BestOfFamily_AutoML_20210826_123729 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 12:38:11.54: StackedEnsemble_AllModels_AutoML_20210826_123729 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 13:04:16.622: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 33 models.
## 13:04:40.6: StackedEnsemble_BestOfFamily_AutoML_20210826_130416 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 13:04:41.11: StackedEnsemble_AllModels_AutoML_20210826_130416 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 13:05:15.851: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
## 13:05:15.853: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 43 models.
## 13:05:28.637: StackedEnsemble_BestOfFamily_AutoML_20210826_130515 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 13:05:29.646: StackedEnsemble_AllModels_AutoML_20210826_130515 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 13:06:08.231: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
## 13:06:08.234: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 53 models.
## 13:06:17.708: Skipping training of model GBM_5_AutoML_20210826_130608 due to exception: water.exceptions.H2OModelBuilderIllegalArgumentException: Illegal argument(s) for GBM model: GBM_5_AutoML_20210826_130608.  Details: ERRR on field: _min_rows: The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 158.0.
## 
## 13:06:19.738: StackedEnsemble_BestOfFamily_AutoML_20210826_130608 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 13:06:20.743: StackedEnsemble_AllModels_AutoML_20210826_130608 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:08:23.609: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 63 models.
## 14:08:43.939: StackedEnsemble_BestOfFamily_AutoML_20210826_140823 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:08:44.947: StackedEnsemble_AllModels_AutoML_20210826_140823 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:09:30.900: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
## 14:09:30.902: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 73 models.
## 14:09:43.604: StackedEnsemble_BestOfFamily_AutoML_20210826_140930 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:09:44.610: StackedEnsemble_AllModels_AutoML_20210826_140930 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:10:32.891: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
## 14:10:32.892: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 83 models.
## 14:10:42.365: Skipping training of model GBM_5_AutoML_20210826_141032 due to exception: water.exceptions.H2OModelBuilderIllegalArgumentException: Illegal argument(s) for GBM model: GBM_5_AutoML_20210826_141032.  Details: ERRR on field: _min_rows: The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 158.0.
## 
## 14:10:44.406: StackedEnsemble_BestOfFamily_AutoML_20210826_141032 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:10:45.415: StackedEnsemble_AllModels_AutoML_20210826_141032 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:24:38.327: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 93 models.
## 14:24:57.664: StackedEnsemble_BestOfFamily_AutoML_20210826_142438 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:24:58.672: StackedEnsemble_AllModels_AutoML_20210826_142438 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:25:56.530: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
## 14:25:56.531: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 103 models.
## 14:26:09.231: StackedEnsemble_BestOfFamily_AutoML_20210826_142556 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:26:10.238: StackedEnsemble_AllModels_AutoML_20210826_142556 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:27:11.872: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
## 14:27:11.874: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 113 models.
## 14:27:21.402: Skipping training of model GBM_5_AutoML_20210826_142711 due to exception: water.exceptions.H2OModelBuilderIllegalArgumentException: Illegal argument(s) for GBM model: GBM_5_AutoML_20210826_142711.  Details: ERRR on field: _min_rows: The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 158.0.
## 
## 14:27:23.444: StackedEnsemble_BestOfFamily_AutoML_20210826_142711 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:27:24.451: StackedEnsemble_AllModels_AutoML_20210826_142711 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:40:41.11: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 123 models.
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |=============                                                         |  18%
  |                                                                            
  |===============                                                       |  21%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |===================                                                   |  26%
  |                                                                            
  |=====================                                                 |  30%
## 14:41:01.402: StackedEnsemble_BestOfFamily_AutoML_20210826_144041 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:41:02.410: StackedEnsemble_AllModels_AutoML_20210826_144041 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
  |                                                                            
  |======================================================================| 100%

3. Model Comparision

3-1. 모델간 비교결과를 보고, 최적의 모델을 찾습니다.

h2o에서 특정 model을 선택한 것이 아닌, autoML을 진행했기 때문에 사용된 Model들을 비교합니다.
auc 기준으로는 “StackedEnsemble_BestOfFamily”이 가장 좋은 것으로 나타났습니다.
Stacked Ensemble 모형은 Stacking프로세스를 사용하여,
XGBoost, GBM, GLM 등 여러 모델로부터 최적의 조합을 찾는 알고리즘입니다.
여기서도 앞서 해본 것처럼 GLM이 RandomForest보다 우수한 결과를 보여주는 것을
확인할 수 있습니다.

# Best models
best_models <- auto_ml@leaderboard

best_models %>% as.data.frame %>% DT::datatable()

3-2. 최적의 모델에 기여한 기여도를 평가해봅니다.

Best of Family 메타 학습모형에 대한 기여도를 평가합니다.
GLM, DRF, XGBoost, GBM을 활용하여 최적의 조합을 찾아 모델을 구축한 것을 확인할 수 있습니다.
IBM Dataset에는 GLM이 적합한 것 같습니다.

# best model을 가져옵니다. 
best_model_id <- as.data.frame(best_models$model_id)[,1]

stacked_ensemble_model <- h2o.getModel(grep("StackedEnsemble_BestOfFamily", best_model_id, value=TRUE)[1])

metalearner <- h2o.getModel(stacked_ensemble_model@model$metalearner$name)

h2o.varimp_plot(metalearner)

3-3. Stacked Ensemble BOF Model의 Variable Importance는 아직 지원하지 않습니다.

3-4. GLM과 XGboost의 Variable Importace를 비교해봅니다.

GLM과 XGBoost의 Variable Importance 값을 비교해봅니다.

# explainer <- lime(rnd_train,SEBOF)
# explain_top <- lime::explain(rnd_train[1:5],explainer, n_labels = 2, n_features = 10)
# plot_explanations(explain_top)



glm <- h2o.getModel(grep("GLM", best_model_id, value = TRUE)[1])
xgb <- h2o.getModel(grep("XGBoost", best_model_id, value = TRUE)[1])

# Examine the variable importance of the top XGBoost model
# XGBoost can show the feature importance as oppose to the stack ensemble
h2o.varimp(glm) %>% DT::datatable()

h2o.varimp(xgb)%>% DT::datatable()

두가지 방법에 따라 Variable Importance가 다르게 나타나는 것을 확인할 수 있습니다.

# We can also plot the base learner contributions to the ensemble.
h2o.varimp_plot(glm)

h2o.varimp_plot(xgb)

3-3.Model의 성능을 확인해봅니다.

앞서 손수 구한 모델보다 월등히 뛰어남을 알 수 있습니다.

h2o.performance(auto_ml@leader, h2o_test)->performance_automl

h2o.confusionMatrix(performance_automl)

## Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.444758160347686:
##         No Yes    Error     Rate
## No     155   6 0.037267   =6/161
## Yes     12  23 0.342857   =12/35
## Totals 167  29 0.091837  =18/196

h2o.F1(performance_automl, thresholds = .5)

## Warning in h2o.find_row_by_threshold(o, t): Could not find exact threshold: 0.5
## for this set of metrics; using closest threshold found: 0.50628906742409. Run
## `h2o.predict` and apply your desired threshold on a probability column.

## [[1]]
## [1] 0.6315789

h2o.accuracy(performance_automl, thresholds = .5)

## Warning in h2o.find_row_by_threshold(o, t): Could not find exact threshold: 0.5
## for this set of metrics; using closest threshold found: 0.50628906742409. Run
## `h2o.predict` and apply your desired threshold on a probability column.

## [[1]]
## [1] 0.8928571

h2o.recall(performance_automl, thresholds = .5)

## Warning in h2o.find_row_by_threshold(o, t): Could not find exact threshold: 0.5
## for this set of metrics; using closest threshold found: 0.50628906742409. Run
## `h2o.predict` and apply your desired threshold on a probability column.

## [[1]]
## [1] 0.5142857

h2o.auc(performance_automl)

## [1] 0.9110914

plot(performance_automl, type="roc")

3-4.Model을 다시 사용할 수 있도록 저장합니다.

모델을 다시 활용할 수 있도록 저장합니다.

model_path <- h2o.saveModel(auto_ml@leader, path=getwd(), force=TRUE)

model_path
## [1] "/Users/raymondkim/Rproject/Turnover/StackedEnsemble_BestOfFamily_AutoML_20210826_123312"

Modeling4. HR Experience

1. Essence of people analytics

1-1. Why are we trying to predict attrition

$Y = aX+b$

predictive people analytics를 하려는 이유는
긍정적인 방향으로 변화시키고 싶은 Y에 영향을 미치는, 변화 가능한 X를 찾는 것인데,
앞선 분석들은 가능한 모든 X를 넣고 예측만 잘 하려고 하는 과정으로 볼 수 있습니다.
퇴직 예측분석을 통해 변화시키고 싶은 것은 앞서도 살펴보았지만,아래와 같을 것입니다.

No	Y
1	핵심인재 Retention
2	workforce planning
3	고용 전 적합성 판단
4	교육 및 훈련계획 수립

단순히 구성원의 퇴직 확률을 예측하기보다, 핵심인재의 퇴사를 방지하고 싶다면,
현실적으로 변화 가능한 X는 무엇인지 고민해봅니다.

No	X	intervention
1	Years in Current Role	사내 부서 이동
2	Years with current Manager	사내 부서 이동, 조직장 보임
3	Over Time	연장 근로 제한, 재택근무
4	WorkLIfe Balance	연장 근로 제한, 재택근무
5	Environment Satisfaction	근로 환경 개선
6	Distance From Home	재택근무, 거점오피스
7	Business Travel	VR/영상회의 시스템 구축
8	Department	사내 부서 이동
9	Training Time Last Year	교육체계 수립 및 운영
10	Years Since Last Promotion	승진
11	Years with current manager	사내 부서 이동

이를 기반으로 모델링을 다시 해보겠습니다.
간편하게 AutoML을 활용하겠습니다.

2. Preprocessing

2-1. Dataset 재구성

Dataset에서 앞서 살펴 본 변화 가능한 X들만 추출하여 재구성합니다.


Dataset <- readRDS("Dataset_pre.RDS")

Dataset %>% colnames()
##  [1] "Age"                      "Attrition"               
##  [3] "BusinessTravel"           "DailyRate"               
##  [5] "Department"               "DistanceFromHome"        
##  [7] "Education"                "EducationField"          
##  [9] "EnvironmentSatisfaction"  "Gender"                  
## [11] "HourlyRate"               "JobInvolvement"          
## [13] "JobLevel"                 "JobRole"                 
## [15] "JobSatisfaction"          "MaritalStatus"           
## [17] "MonthlyRate"              "NumCompaniesWorked"      
## [19] "OverTime"                 "PercentSalaryHike"       
## [21] "PerformanceRating"        "RelationshipSatisfaction"
## [23] "StockOptionLevel"         "TotalWorkingYears"       
## [25] "TrainingTimesLastYear"    "WorkLifeBalance"         
## [27] "YearsAtCompany"           "YearsInCurrentRole"      
## [29] "YearsSinceLastPromotion"  "YearsWithCurrManager"

Dataset %>% dplyr::select(Attrition,YearsInCurrentRole,OverTime, WorkLifeBalance, 
                          EnvironmentSatisfaction,DistanceFromHome, Department, 
                          TrainingTimesLastYear,YearsSinceLastPromotion,YearsWithCurrManager)->Dataset_HR

Dataset_HR  %>% mutate_if(is.character, factor)-> Data_HR

# Setting Reference level
Data_HR$Attrition <- relevel(Data_auto$Attrition, ref = "Yes")

Data_HR %>% recipe(Attrition~.) %>% 
  step_normalize(all_numeric()) %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_corr(all_numeric()) %>% prep()-> h2o_recipe_re


h2o_recipe_re %>% juice -> Dataset_h2o_re

# Putting the original dataframe into an h2o format
Dataset_h2o_re %>% as.h2o(destination_frame = "h2o_df_re")->h2o_df_re
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

# Splitting into training, validation and testing sets
split_df_re <- h2o.splitFrame(h2o_df_re, c(0.7, 0.15), seed=12)

# Obtaining our three types of sets into three separate values
h2o_train_re <- h2o.assign(split_df_re[[1]], "train")
h2o_validation_re <- h2o.assign(split_df_re[[2]], "validation")
h2o_test_re <- h2o.assign(split_df_re[[2]], "test")


h2o.describe(h2o_train_re)
##                                Label Type Missing Zeros PosInf NegInf
## 1                 YearsInCurrentRole real       0     0      0      0
## 2                   DistanceFromHome real       0     0      0      0
## 3              TrainingTimesLastYear real       0     0      0      0
## 4            YearsSinceLastPromotion real       0     0      0      0
## 5               YearsWithCurrManager real       0     0      0      0
## 6                          Attrition enum       0   838      0      0
## 7                       OverTime_Yes  int       0   719      0      0
## 8               WorkLifeBalance_Best  int       0   886      0      0
## 9             WorkLifeBalance_Better  int       0   395      0      0
## 10              WorkLifeBalance_Good  int       0   765      0      0
## 11       EnvironmentSatisfaction_Low  int       0   801      0      0
## 12    EnvironmentSatisfaction_Medium  int       0   802      0      0
## 13 EnvironmentSatisfaction_Very.High  int       0   687      0      0
## 14 Department_Research...Development  int       0   349      0      0
##           Min      Max         Mean     Sigma Cardinality
## 1  -1.1788794 3.826003 -0.012229704 0.9933579          NA
## 2  -1.0113326 2.484078  0.001687752 1.0064210          NA
## 3  -2.1847628 2.498781 -0.010428912 0.9911376          NA
## 4  -0.6928449 4.421544 -0.044992894 0.9661122          NA
## 5  -1.1627758 3.583973 -0.005964821 0.9948860          NA
## 6   0.0000000 1.000000  0.156092649 0.3631260           2
## 7   0.0000000 1.000000  0.275931521 0.4472077          NA
## 8   0.0000000 1.000000  0.107754280 0.3102261          NA
## 9   0.0000000 1.000000  0.602215509 0.4896871          NA
## 10  0.0000000 1.000000  0.229607251 0.4207922          NA
## 11  0.0000000 1.000000  0.193353474 0.3951267          NA
## 12  0.0000000 1.000000  0.192346425 0.3943423          NA
## 13  0.0000000 1.000000  0.308157100 0.4619645          NA
## 14  0.0000000 1.000000  0.648539778 0.4776669          NA

2.Modeling

2-1. AutoML을 진행합니다.

# Establish X and Y (Features and Labels)
y1 <- "Attrition"
x1 <- setdiff(names(h2o_train_re), y)


automl <- h2o.automl(
    y = y1,
    x = x1,
    training_frame = h2o_train_re,
    validation_frame = h2o_validation_re,
    project_name = "Attrition",
    max_models = 10,
    seed = 1
)
## 
  |                                                                            
  |                                                                      |   0%
## 12:35:25.425: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
## 12:35:25.428: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 12 models.
## 12:36:22.957: StackedEnsemble_AllModels_AutoML_20210826_123525 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 12:37:29.811: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
## 12:37:29.828: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 23 models.
## 12:38:05.936: Skipping training of model GBM_5_AutoML_20210826_123729 due to exception: water.exceptions.H2OModelBuilderIllegalArgumentException: Illegal argument(s) for GBM model: GBM_5_AutoML_20210826_123729.  Details: ERRR on field: _min_rows: The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 158.0.
## 
## 12:38:10.40: StackedEnsemble_BestOfFamily_AutoML_20210826_123729 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 12:38:11.54: StackedEnsemble_AllModels_AutoML_20210826_123729 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 13:04:16.622: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 33 models.
## 13:04:40.6: StackedEnsemble_BestOfFamily_AutoML_20210826_130416 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 13:04:41.11: StackedEnsemble_AllModels_AutoML_20210826_130416 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 13:05:15.851: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
## 13:05:15.853: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 43 models.
## 13:05:28.637: StackedEnsemble_BestOfFamily_AutoML_20210826_130515 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 13:05:29.646: StackedEnsemble_AllModels_AutoML_20210826_130515 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 13:06:08.231: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
## 13:06:08.234: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 53 models.
## 13:06:17.708: Skipping training of model GBM_5_AutoML_20210826_130608 due to exception: water.exceptions.H2OModelBuilderIllegalArgumentException: Illegal argument(s) for GBM model: GBM_5_AutoML_20210826_130608.  Details: ERRR on field: _min_rows: The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 158.0.
## 
## 13:06:19.738: StackedEnsemble_BestOfFamily_AutoML_20210826_130608 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 13:06:20.743: StackedEnsemble_AllModels_AutoML_20210826_130608 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:08:23.609: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 63 models.
## 14:08:43.939: StackedEnsemble_BestOfFamily_AutoML_20210826_140823 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:08:44.947: StackedEnsemble_AllModels_AutoML_20210826_140823 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:09:30.900: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
## 14:09:30.902: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 73 models.
## 14:09:43.604: StackedEnsemble_BestOfFamily_AutoML_20210826_140930 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:09:44.610: StackedEnsemble_AllModels_AutoML_20210826_140930 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:10:32.891: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
## 14:10:32.892: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 83 models.
## 14:10:42.365: Skipping training of model GBM_5_AutoML_20210826_141032 due to exception: water.exceptions.H2OModelBuilderIllegalArgumentException: Illegal argument(s) for GBM model: GBM_5_AutoML_20210826_141032.  Details: ERRR on field: _min_rows: The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 158.0.
## 
## 14:10:44.406: StackedEnsemble_BestOfFamily_AutoML_20210826_141032 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:10:45.415: StackedEnsemble_AllModels_AutoML_20210826_141032 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:24:38.327: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 93 models.
## 14:24:57.664: StackedEnsemble_BestOfFamily_AutoML_20210826_142438 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:24:58.672: StackedEnsemble_AllModels_AutoML_20210826_142438 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:25:56.530: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
## 14:25:56.531: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 103 models.
## 14:26:09.231: StackedEnsemble_BestOfFamily_AutoML_20210826_142556 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:26:10.238: StackedEnsemble_AllModels_AutoML_20210826_142556 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:27:11.872: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
## 14:27:11.874: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 113 models.
## 14:27:21.402: Skipping training of model GBM_5_AutoML_20210826_142711 due to exception: water.exceptions.H2OModelBuilderIllegalArgumentException: Illegal argument(s) for GBM model: GBM_5_AutoML_20210826_142711.  Details: ERRR on field: _min_rows: The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 158.0.
## 
## 14:27:23.444: StackedEnsemble_BestOfFamily_AutoML_20210826_142711 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:27:24.451: StackedEnsemble_AllModels_AutoML_20210826_142711 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:40:41.11: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 123 models.
## 14:41:01.402: StackedEnsemble_BestOfFamily_AutoML_20210826_144041 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:41:02.410: StackedEnsemble_AllModels_AutoML_20210826_144041 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:42:16.854: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
## 14:42:16.855: New models will be added to existing leaderboard Attrition@@Attrition (leaderboard frame=validation) with already 133 models.
  |                                                                            
  |=============                                                         |  19%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |=====================                                                 |  30%
## 14:42:29.663: StackedEnsemble_BestOfFamily_AutoML_20210826_144216 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
## 14:42:30.667: StackedEnsemble_AllModels_AutoML_20210826_144216 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
  |                                                                            
  |======================================================================| 100%

2-2. AutoML 결과, GLM이 가장 우수한 모델로 나타났습니다.

Stacked Ensemble Model을 이기고, GLM이 auc 기반으로 가장 우수한 모델로 나타났습니다.

# Best models
best_models <- automl@leaderboard
best_models %>% as.data.frame %>% DT::datatable()

3. Model Evaluation

3-1. F1 score, AUC 기반으로 Model Evaluation을 진행합니다.

h2o.performance(automl@leader, h2o_test_re)->performance_automl_re

h2o.confusionMatrix(performance_automl_re)

## Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.261791635152933:
##         No Yes    Error     Rate
## No     146  15 0.093168  =15/161
## Yes     11  24 0.314286   =11/35
## Totals 157  39 0.132653  =26/196

h2o.F1(performance_automl_re, thresholds = .5)

## Warning in h2o.find_row_by_threshold(o, t): Could not find exact threshold: 0.5
## for this set of metrics; using closest threshold found: 0.522384473931713. Run
## `h2o.predict` and apply your desired threshold on a probability column.

## [[1]]
## [1] 0.2

h2o.accuracy(performance_automl_re, thresholds = .5)

## Warning in h2o.find_row_by_threshold(o, t): Could not find exact threshold: 0.5
## for this set of metrics; using closest threshold found: 0.522384473931713. Run
## `h2o.predict` and apply your desired threshold on a probability column.

## [[1]]
## [1] 0.8367347

h2o.recall(performance_automl_re, thresholds = .5)

## Warning in h2o.find_row_by_threshold(o, t): Could not find exact threshold: 0.5
## for this set of metrics; using closest threshold found: 0.522384473931713. Run
## `h2o.predict` and apply your desired threshold on a probability column.

## [[1]]
## [1] 0.1142857

h2o.auc(performance_automl_re)

## [1] 0.8436557

plot(performance_automl_re, type="roc")

3-2. Variable Importance를 확인합니다.

Variable Importance 값을 확인 결과, Years in Current Role과 OverTime_Yes가 큰 영향을 미치는 것으로 나타났습니다.

# best model을 가져옵니다. 
best_model_id2 <- as.data.frame(best_models$model_id)[,1]
glm_re <- h2o.getModel(grep("GLM", best_model_id2, value = TRUE)[1])
h2o.varimp(glm_re) %>% DT::datatable()

4.Group level

4-1. 고성과 집단을 구분해봅니다

Performance Rating이 ’Outstanding’인 고성과 집단도 퇴직과 재직의 비율이
원 데이터와 유사함을 확인할 수 있습니다.

Dataset <- readRDS("Dataset_pre.RDS")

Dataset %>% dplyr::select(PerformanceRating) %>% unique

## # A tibble: 2 x 1
##   PerformanceRating
##   <chr>            
## 1 Excellent        
## 2 Outstanding

Dataset %>% filter(PerformanceRating %in% "Outstanding") %>% nrow

## [1] 212

Dataset %>% filter(PerformanceRating %in% "Outstanding") %>% dplyr::select(Attrition) %>% table

## .
##  No Yes 
## 175  37

# Diversity는 비슷하게 구성되어 있음 

Dataset %>% filter(PerformanceRating %in% "Outstanding")->Dataset_High

Dataset_High  %>% mutate_if(is.character, factor)-> Data_High

4-2. 동일하게 변화가능한 X를 중심으로 데이터를 추출해봅니다.

# Setting Reference level
Data_High$Attrition <- relevel(Data_High$Attrition, ref = "Yes")

Data_High %>% dplyr::select(-PerformanceRating) %>% 
  dplyr::select(Attrition,YearsInCurrentRole,OverTime, WorkLifeBalance, 
                          EnvironmentSatisfaction,DistanceFromHome, Department, 
                          TrainingTimesLastYear,YearsSinceLastPromotion,YearsWithCurrManager)->Data_High

Data_High %>% recipe(Attrition~.) %>% 
  step_normalize(all_numeric()) %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_corr(all_numeric()) %>% prep()-> h2o_recipe_High


h2o_recipe_High %>% juice -> Dataset_h2o_High

# Putting the original dataframe into an h2o format
Dataset_h2o_High %>% as.h2o(destination_frame = "h2o_df")->h2o_df_High

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

# Splitting into training, validation and testing sets
split_df_High <- h2o.splitFrame(h2o_df_High, c(0.7, 0.15), seed=12)

# Obtaining our three types of sets into three separate values
h2o_train_High <- h2o.assign(split_df_High[[1]], "train")
h2o_validation_High <- h2o.assign(split_df_High[[2]], "validation")
h2o_test_High <- h2o.assign(split_df_High[[2]], "test")


h2o.describe(h2o_train_High)

##                                Label Type Missing Zeros PosInf NegInf
## 1                 YearsInCurrentRole real       0     0      0      0
## 2                   DistanceFromHome real       0     0      0      0
## 3              TrainingTimesLastYear real       0     0      0      0
## 4            YearsSinceLastPromotion real       0     0      0      0
## 5               YearsWithCurrManager real       0     0      0      0
## 6                          Attrition enum       0   135      0      0
## 7                       OverTime_Yes  int       0   110      0      0
## 8               WorkLifeBalance_Best  int       0   140      0      0
## 9             WorkLifeBalance_Better  int       0    62      0      0
## 10              WorkLifeBalance_Good  int       0   124      0      0
## 11       EnvironmentSatisfaction_Low  int       0   129      0      0
## 12    EnvironmentSatisfaction_Medium  int       0   127      0      0
## 13 EnvironmentSatisfaction_Very.High  int       0   114      0      0
## 14 Department_Research...Development  int       0    48      0      0
## 15                  Department_Sales  int       0   116      0      0
##           Min      Max        Mean     Sigma Cardinality
## 1  -1.1954805 3.396897 -0.01832029 0.9943792          NA
## 2  -1.0107498 2.250017 -0.03340425 0.9781307          NA
## 3  -2.1589112 2.583982 -0.02761118 1.0003229          NA
## 4  -0.6732738 3.673956  0.01916470 0.9943281          NA
## 5  -1.2019820 3.242556 -0.01114584 0.9985143          NA
## 6   0.0000000 1.000000  0.14556962 0.3537956           2
## 7   0.0000000 1.000000  0.30379747 0.4613586          NA
## 8   0.0000000 1.000000  0.11392405 0.3187292          NA
## 9   0.0000000 1.000000  0.60759494 0.4898387          NA
## 10  0.0000000 1.000000  0.21518987 0.4122607          NA
## 11  0.0000000 1.000000  0.18354430 0.3883430          NA
## 12  0.0000000 1.000000  0.19620253 0.3983862          NA
## 13  0.0000000 1.000000  0.27848101 0.4496767          NA
## 14  0.0000000 1.000000  0.69620253 0.4613586          NA
## 15  0.0000000 1.000000  0.26582278 0.4431750          NA

4-3. AutoML을 진행합니다.

# Establish X and Y (Features and Labels)
y <- "Attrition"
x <- setdiff(names(h2o_train_High), y)

automl_high <- h2o.automl(
    y = y,
    x = x,
    training_frame = h2o_train_High,
    validation_frame = h2o_validation_High,
    project_name = "Attrition",
    max_models = 10,
    seed = 1
)

4-3. AutoML 결과, DRF가 가장 우수한 모델로 나타났습니다.

Stacked Ensemble Model을 이기고, DRF가 auc 기반으로 가장 우수한 모델로 나타났습니다.

# Best models
best_models_High <- automl_high@leaderboard
best_models_High %>% as.data.frame %>% DT::datatable()

4-4. 성능을 확인해봅니다.

h2o.performance(automl_high@leader, h2o_test_High)->performance_automl_High

# 
# max f1 @ threshold = 0.479166666666667:

h2o.confusionMatrix(performance_automl_High)

## Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.479166666666667:
##        No Yes    Error   Rate
## No     20   0 0.000000  =0/20
## Yes     0   7 0.000000   =0/7
## Totals 20   7 0.000000  =0/27

h2o.F1(performance_automl_High)

##      threshold        f1
## 1  0.873958333 0.2500000
## 2  0.770833333 0.4444444
## 3  0.729166667 0.6000000
## 4  0.645833333 0.7272727
## 5  0.583333333 0.8333333
## 6  0.500000000 0.9230769
## 7  0.479166667 1.0000000
## 8  0.423611111 0.9333333
## 9  0.187500000 0.8750000
## 10 0.135416667 0.7777778
## 11 0.130555555 0.7368421
## 12 0.114583333 0.7000000
## 13 0.062500000 0.6666667
## 14 0.041666667 0.6086957
## 15 0.028472222 0.5833333
## 16 0.020833333 0.4827586
## 17 0.010416667 0.4666667
## 18 0.006944444 0.4516129
## 19 0.000000000 0.4117647

h2o.accuracy(performance_automl_High)

##      threshold  accuracy
## 1  0.873958333 0.7777778
## 2  0.770833333 0.8148148
## 3  0.729166667 0.8518519
## 4  0.645833333 0.8888889
## 5  0.583333333 0.9259259
## 6  0.500000000 0.9629630
## 7  0.479166667 1.0000000
## 8  0.423611111 0.9629630
## 9  0.187500000 0.9259259
## 10 0.135416667 0.8518519
## 11 0.130555555 0.8148148
## 12 0.114583333 0.7777778
## 13 0.062500000 0.7407407
## 14 0.041666667 0.6666667
## 15 0.028472222 0.6296296
## 16 0.020833333 0.4444444
## 17 0.010416667 0.4074074
## 18 0.006944444 0.3703704
## 19 0.000000000 0.2592593

h2o.recall(performance_automl_High)

##      threshold       tpr
## 1  0.873958333 0.1428571
## 2  0.770833333 0.2857143
## 3  0.729166667 0.4285714
## 4  0.645833333 0.5714286
## 5  0.583333333 0.7142857
## 6  0.500000000 0.8571429
## 7  0.479166667 1.0000000
## 8  0.423611111 1.0000000
## 9  0.187500000 1.0000000
## 10 0.135416667 1.0000000
## 11 0.130555555 1.0000000
## 12 0.114583333 1.0000000
## 13 0.062500000 1.0000000
## 14 0.041666667 1.0000000
## 15 0.028472222 1.0000000
## 16 0.020833333 1.0000000
## 17 0.010416667 1.0000000
## 18 0.006944444 1.0000000
## 19 0.000000000 1.0000000

h2o.precision(performance_automl_High)

##      threshold precision
## 1  0.873958333 1.0000000
## 2  0.770833333 1.0000000
## 3  0.729166667 1.0000000
## 4  0.645833333 1.0000000
## 5  0.583333333 1.0000000
## 6  0.500000000 1.0000000
## 7  0.479166667 1.0000000
## 8  0.423611111 0.8750000
## 9  0.187500000 0.7777778
## 10 0.135416667 0.6363636
## 11 0.130555555 0.5833333
## 12 0.114583333 0.5384615
## 13 0.062500000 0.5000000
## 14 0.041666667 0.4375000
## 15 0.028472222 0.4117647
## 16 0.020833333 0.3181818
## 17 0.010416667 0.3043478
## 18 0.006944444 0.2916667
## 19 0.000000000 0.2592593

h2o.auc(performance_automl_High)

## [1] 1

plot(performance_automl_High, type="roc")

4-5. Variable Importance를 확인합니다.

DRF의 Variable Importance를 확인한 결과,고성과자의 경우 Overtime_Yes와 Year Since Last Promotion이
가장 높은 영향을 미치는 것으로 나타났습니다.

# best model을 가져옵니다. 
best_model_id_High <- as.data.frame(best_models_High$model_id)[,1]

DRF_re <- h2o.getModel(grep("DRF", best_model_id_High, value = TRUE)[1])

h2o.varimp(DRF_re) %>% DT::datatable()

Group Level로 나누면 또 다른 변수가 나옴을 확인했습니다.
이제, 이렇게 얻은 분석결과는 조직의 상황과 맥락에 맞춰 해석하고 그에 맞는 Internvention을 기획하여 실행하면 됩니다!

h2o.shutdown()

## Are you sure you want to shutdown the H2O instance running at http://localhost:54321/ (Y/N)?

https://hbr.org/2019/08/better-ways-to-predict-whos-going-to-quit ↩︎
https://www.etoday.co.kr/news/view/1747355 ↩︎
Speer, A. B. (2021). Empirical attrition modelling and discrimination: Balancing validity and group differences. Human Resource Management Journal.↩︎
Gibson, C., Koenig, N., Griffith, J., & Hardy, J. H. (2019). Selecting for retention: Understanding turnover prehire. Industrial and Organizational Psychology, 12(3), 338-341.↩︎
McCloy, R. A., Smith, E. A., & Anderson, M. G. (2016). Predicting voluntary turnover from engagement data. In 31st Annual Conference of the Society for Industrial & Organizational Psychology, Anaheim, CA.↩︎
Speer, A. B., Dutta, S., Chen, M., & Trussell, G. (2019). Here to stay or go? Connecting turnover research to applied attrition modeling. Industrial and Organizational Psychology, 12(3), 277-301.↩︎
Strickland, W. J. (2005). A longitudinal examination of first term attrition and reenlistment among FY1999 enlisted accessions. HUMAN RESOURCES RESEARCH ORGANIZATION ALEXANDRIA VA.↩︎
Brown, I., & Mues, C. (2012). An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Systems with Applications, 39(3), 3446-3453.↩︎
박재신, & 방성완. (2015). 불균형 자료의 분류분석에서 샘플링 기법을 이용한로지스틱 회귀분석. Journal of The Korean Data Analysis Society, 17(4), 1877-1888.↩︎

Turnover_Modeling

yuaye.kt@gmail.com

2021-08-26

Introduction

People Analytics 전문가를 꿈꾸며

Topic: Attrition/Turnover

1. why attrition/turnover?

주제 선정 이유

2. Literature Review

Definition of Attrition Modeling

Purpose of Attrition Modeling

Related variables

Preprocessing for Attrition Modeling

1. Process

분석 방향 및 프로세스

2. Data Import

2-1. 분석에 사용할 library를 load합니다.

2-2. Data를 Import합니다.

3. Preprocessing

3-1. 어떤 변수들로 구성되어 있는지 확인합니다.

3-2. 단일 값을 가진 변수를 제거합니다.

3-3. Data Type 등 수정하거나 처리가 필요한 부분을 조치합니다.

3-4. 결측치를 파악합니다.

4. Exploratory Data Analysis(1)

4-1. 데이터를 다시 한번 진단합니다.

4-2. univariate outlier를 확인합니다.

4-4. univariate Outlier의 제거/유지를 결정합니다.

5. Exploratory Data Analysis(2)

5-1.multivariate outlier를 확인합니다.

5-2.Diversity를 확인하여 제거여부를 결정합니다.

5-3.다양한 각도에서 Variable간 관계를 파악합니다.

Modeling1. Logistic Regression

1. Modeling

1-1. Logistic Regression?

1-2. Split the Data

1-3. recipe를 활용하여 한번 더 preprocessing을 진행합니다.

1-4. Model을 세팅하고, train data로 학습합니다.

1-5. test를 통해 예측성능을 평가해봅니다.

2. Modeling with SMOTE

2-1. Imbalanced Data를 oversampling 합니다.

2-2. Model을 세팅하고, train data로 학습합니다.

2-3. 오히려 성능이 좋지 않음을 확인할 수 있습니다.

3. Modeling with backward selection

3-1. stepwise logistic regression을 통해 predictor를 선택하여 Model 성능을 개선합니다.

3-2. 평가 결과, AUC 기준으로 모델 성능이 개선되었음을 알 수 있었습니다.

4. variable importance

4-1 변수의 중요도를 구합니다.

Modeling2. RandomForest

1. preprocessing

1-1. RandomForest?

1-2. Data를 준비합니다.

1-3. Validation을 위해 Data set을 구성합니다.

2. Modeling

2-1. Hyperparameter를 setting합니다.

2-2. workflow를 설정해줍니다.

2-3. hyperparameter tuning을 위한 grid search를 진행합니다.

2-4. roc_auc 기반으로 찾아낸 최적의 hyperparameter를 세팅합니다.

3. Model Evalutation

3-1. roc_auc 기반 best grid를 기반으로 model fitting을 진행합니다.

3-1. workflow를 update하고 metrics를 확인합니다.

3-2. Variable Importance를 확인합니다.

4. Model Improvement

4-1. 앞서 stepwise selection으로 선정된 variable만 사용합니다.

4-2. workflow에 step_glm의 formula를 추가해줍니다.

4-3. workflow를 업데이트 하고, 최종 모델의 성능을 평가합니다.

4-4. 모형이 개선되지 않았음을 확인했습니다.

Modeling3. h2o.AutoML

1. preprocessing

1-1. AutoML?

1-2. Preprocessing

2.Modeling

2-1. AutoML을 진행합니다.

3. Model Comparision

3-1. 모델간 비교결과를 보고, 최적의 모델을 찾습니다.

3-2. 최적의 모델에 기여한 기여도를 평가해봅니다.

3-3. Stacked Ensemble BOF Model의 Variable Importance는 아직 지원하지 않습니다.

3-4. GLM과 XGboost의 Variable Importace를 비교해봅니다.

3-3.Model의 성능을 확인해봅니다.

3-4.Model을 다시 사용할 수 있도록 저장합니다.

Modeling4. HR Experience