H그룹 인사실에서 다양한 HR 업무 경험을 하며,
인사,조직문화진단,협업네트워크 등의 HR 데이터를 분석하고 있는 김광태입니다.
2019년부터 2020년 초까지 Analytics 역량을 키우기 위해 지금까지 들었던 강의/자료와 HR 관련 분석기법들은
아래 링크에 정리해두었습니다.
Data Analytics 분야는 하루가 다르게 발전하고 있고, 새로운 기법들이 도입되고 있기에 끊임없이 공부하고 있습니다.
함께 분석 노하우를 공유하며, 나누어주시거나 제가 올린 내용에 대한 문의는 언제든지 환영합니다.
yuaye.kt@gmail.com 로 메일 주시면, 회신 드리겠습니다:)
Attrition은 HR에서 항상 관심을 갖고 있는 주제이며
몇 년 전부터는 개방형 혁신으로, 산업분야와 회사간 인재 Pool의 경계가 모호해지면서,
SW, Bio 등을 중심으로 우수 인재에 대한 Talent Attraction이나 핵심 인력에 대한
Attrition/Turnover Management의 중요성이 높아지고 있습니다.1
퇴사자를 예측하는 Attrition Modeling은 산업간 인재 Pool의 경계가 모호했던 미국 등
선진국 중심으로 지속 연구되어 왔으며, 아래 차트를 보시면 Management 분야에서
Attrition/Turnover에 대한 연구가 지속 증가하고 있음을 확인할 수 있습니다.
이미 많은 기업에서 Attrition/Turnover에 대해 분석하고, 예측모델을 개발하여
employee Retention, Talent Attraction 등에 활용하고 있습니다.2
data.frame(x=2000:2021, y=c(1,1,14,10,17,20,22,31,27,36,34,56,47,50,59,54,72,97,85,112,115,75))
ggplot(aes(x, y, label=paste0(y,"편")))+geom_line()+theme_bw()+theme(plot.caption= element_text(hjust= 0))+
theme(axis.line = element_line(size=1), axis.ticks = element_line(size=1),panel.border = element_blank(), panel.grid.major = element_blank(),panel.grid.minor = element_blank())+geom_point()+ggrepel::geom_text_repel(size=3)+labs( x="publication year",y= "Number of publications in the field of management", caption = "Note. web of science 기준, Attrition/Turnover keyword 포함 논문 수")
Python, R을 기반으로 많은 Open Source들이 존재하지만,
대부분 hyperparameter tuning 등 분석기법 그 자체에만 집중하기에
저는 theoretical background 및 HR 업무 경험을 추가하여 Modeling을 진행했습니다.
이곳에는 3가지 Model을 기반으로 기본적인 분석을 진행한 내용을 정리하였으며,
HR 경험 기반의 Modeling과 전처리 등 한번에 표현이 불가능한 내용들은 포함하지 못했습니다.
Attrition models combine variables that predict turnover into statistical algorithms that then estimate the probability of employee turnover within a given timeframe, or at a specific timepoint;
Attrition Modeling의 목적에 대해서는 여러가지 선행연구를 기반으로 네 가지 목적을 제시하였는데, 요약하면 다음과 같습니다.
1. pre-employment selection4
2. validate and develop training initiative5
3. facilitate workforce planning discussions with specific part of the company6
4. create and hoc programs to reduce attrition7
Attrition Modeling은 이처럼 채용 의사결정, 인력 조정 및 개발 계획 수립을 위한 이니셔티브,
구성원의 Attrition을 줄이기 위한 intervention 설계 등, 전략적 HR을 위한 다양한 시사점을 제공합니다.
Purpose of Attrition Modeling: The formed attrition estimates can then serve a number of purposes, including use for pre-employment selection (Gibson et al., 2019; Strickland, 2005), to validate and develop training initiatives (McCloy et al., 2016; Strickland, 2005), to facilitate workforce planning discussions with specific parts of the company (Speer et al., 2019), to create ad hoc programs to reduce attrition (Strickland, 2005) and a variety of other HR purposes generally aimed at understanding and impacting employee turnover. The work is conducted both internally and by external vendors as well. For example, HR software companies currently offer features that include projected group-level turnover estimates within HR dashboards, as well as risk projections for individual employees. These are often accompanied by in-depth studies into the root causes of turnover, which then facilitate turnover interventions. Thus, attrition models serve various strategic HR purposes.
Kaggle에 올라와 있는 IBM HR Analytics Dataset을 기반으로
Attirition Model을 구축합니다.
Tidyverse 생태계를 따라, 최대한 tidy하게 작성하려고 노력하였으며,
그동안 People Analytics 업무를 어떤 흐름으로 진행해 왔는지 보여드리고자 노력했습니다.
아래 Process를 기준으로 Modeling을 진행했습니다.
No | Process | R Packages |
1 | Literature Review | |
2 | Data Import | tidyverse |
3 | Tidy data + Transformation, Pre-Processing | tidyverse |
4 | visualization for EDA, Feature Engineering | dlookr, ExpanDar, tidyverse |
5 | Modeling(1) Logistic Regression | tidymodels |
6 | Modeling(2) RandomForest | tidymodels, randomForest |
7 | Modeling(3) AutoML | h2o |
8 | Reporting | Bookdown |
## [1] "/Users/raymondkim/Rproject/Turnover"
# read_csv 기반 tibble type으로 import 합니다.
<- read_csv("archive/Data.csv")
Dataset ##
## ─ Column specification ────────────────────────────
## cols(
## .default = col_double(),
## Attrition = col_character(),
## BusinessTravel = col_character(),
## Department = col_character(),
## EducationField = col_character(),
## Gender = col_character(),
## JobRole = col_character(),
## MaritalStatus = col_character(),
## Over18 = col_character(),
## OverTime = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
# Data Import가 잘 되었는지 확인합니다.
%>% glimpse
Dataset ## Rows: 1,470
## Columns: 35
## $ Age <dbl> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 2…
## $ Attrition <chr> "Yes", "No", "Yes", "No", "No", "No", "No", "…
## $ BusinessTravel <chr> "Travel_Rarely", "Travel_Frequently", "Travel…
## $ DailyRate <dbl> 1102, 279, 1373, 1392, 591, 1005, 1324, 1358,…
## $ Department <chr> "Sales", "Research & Development", "Research …
## $ DistanceFromHome <dbl> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26, …
## $ Education <dbl> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3, …
## $ EducationField <chr> "Life Sciences", "Life Sciences", "Other", "L…
## $ EmployeeCount <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ EmployeeNumber <dbl> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16,…
## $ EnvironmentSatisfaction <dbl> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, 3, …
## $ Gender <chr> "Female", "Male", "Male", "Female", "Male", "…
## $ HourlyRate <dbl> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84, 4…
## $ JobInvolvement <dbl> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, 2, …
## $ JobLevel <dbl> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1, …
## $ JobRole <chr> "Sales Executive", "Research Scientist", "Lab…
## $ JobSatisfaction <dbl> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, 3, …
## $ MaritalStatus <chr> "Single", "Married", "Single", "Married", "Ma…
## $ MonthlyIncome <dbl> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 269…
## $ MonthlyRate <dbl> 19479, 24907, 2396, 23159, 16632, 11864, 9964…
## $ NumCompaniesWorked <dbl> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5, …
## $ Over18 <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", …
## $ OverTime <chr> "Yes", "No", "Yes", "Yes", "No", "No", "Yes",…
## $ PercentSalaryHike <dbl> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, 1…
## $ PerformanceRating <dbl> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3, …
## $ RelationshipSatisfaction <dbl> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, 2, …
## $ StandardHours <dbl> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 8…
## $ StockOptionLevel <dbl> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0, …
## $ TotalWorkingYears <dbl> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5, 3…
## $ TrainingTimesLastYear <dbl> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4, …
## $ WorkLifeBalance <dbl> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3, …
## $ YearsAtCompany <dbl> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, 4,…
## $ YearsInCurrentRole <dbl> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2, …
## $ YearsSinceLastPromotion <dbl> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0, …
## $ YearsWithCurrManager <dbl> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3, …
# 어떤 변수들로 구성되어 있는지 확인합니다.
%>% diagnose() %>% arrange(unique_count)
Dataset ## # A tibble: 35 x 6
## variables types missing_count missing_percent unique_count unique_rate
## <chr> <chr> <int> <dbl> <int> <dbl>
## 1 EmployeeCount numeric 0 0 1 0.000680
## 2 Over18 charac… 0 0 1 0.000680
## 3 StandardHours numeric 0 0 1 0.000680
## 4 Attrition charac… 0 0 2 0.00136
## 5 Gender charac… 0 0 2 0.00136
## 6 OverTime charac… 0 0 2 0.00136
## 7 PerformanceRa… numeric 0 0 2 0.00136
## 8 BusinessTravel charac… 0 0 3 0.00204
## 9 Department charac… 0 0 3 0.00204
## 10 MaritalStatus charac… 0 0 3 0.00204
## # … with 25 more rows
# unique_count = 1 변수(Over18,EmployeeCount, StandardHours),
# 의미 없는 변수(EmployeeNumber)제거
%>% dplyr::select(-Over18, -EmployeeCount, -StandardHours, -EmployeeNumber)->Dataset Dataset
%>% diagnose_category()
Dataset ## # A tibble: 30 x 6
## variables levels N freq ratio rank
## <chr> <chr> <int> <int> <dbl> <int>
## 1 Attrition No 1470 1233 83.9 1
## 2 Attrition Yes 1470 237 16.1 2
## 3 BusinessTravel Travel_Rarely 1470 1043 71.0 1
## 4 BusinessTravel Travel_Frequently 1470 277 18.8 2
## 5 BusinessTravel Non-Travel 1470 150 10.2 3
## 6 Department Research & Development 1470 961 65.4 1
## 7 Department Sales 1470 446 30.3 2
## 8 Department Human Resources 1470 63 4.29 3
## 9 EducationField Life Sciences 1470 606 41.2 1
## 10 EducationField Medical 1470 464 31.6 2
## # … with 20 more rows
%>% diagnose_numeric()
Dataset ## # A tibble: 23 x 10
## variables min Q1 mean median Q3 max zero minus outlier
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int> <int>
## 1 Age 18 30 3.69e1 36 43 60 0 0 0
## 2 DailyRate 102 465 8.02e2 802 1157 1499 0 0 0
## 3 DistanceFromHome 1 2 9.19e0 7 14 29 0 0 0
## 4 Education 1 2 2.91e0 3 4 5 0 0 0
## 5 EnvironmentSatisf… 1 2 2.72e0 3 4 4 0 0 0
## 6 HourlyRate 30 48 6.59e1 66 83.8 100 0 0 0
## 7 JobInvolvement 1 2 2.73e0 3 3 4 0 0 0
## 8 JobLevel 1 1 2.06e0 2 3 5 0 0 0
## 9 JobSatisfaction 1 2 2.73e0 3 4 4 0 0 0
## 10 MonthlyIncome 1009 2911 6.50e3 4919 8379 19999 0 0 114
## # … with 13 more rows
# Education, PerformanceRating, RelationshipSatisfaction, WorkLifeBalance, JobLevel,
# StockOptionLevel, NumCompaniesWorked 이 Categorical variable임을 알 수 있음
# 향후 분석에서 의미를 파악하기 쉽도록 Factor로 변환 필요
#1) Education
gsub(1, 'below College',Dataset$Education) -> Dataset$Education
gsub(2, 'College',Dataset$Education) -> Dataset$Education
gsub(3, 'Bachelor',Dataset$Education) -> Dataset$Education
gsub(4, 'Master',Dataset$Education) -> Dataset$Education
gsub(5, 'Doctor',Dataset$Education) -> Dataset$Education
$Education %>% as.factor %>% unique
Dataset## [1] College below College Master Bachelor Doctor
## Levels: Bachelor below College College Doctor Master
#2) Performance Rating
gsub(1, 'Low',Dataset$PerformanceRating) -> Dataset$PerformanceRating
gsub(2, 'Good',Dataset$PerformanceRating) -> Dataset$PerformanceRating
gsub(3, 'Excellent',Dataset$PerformanceRating) -> Dataset$PerformanceRating
gsub(4, 'Outstanding',Dataset$PerformanceRating) -> Dataset$PerformanceRating
$PerformanceRating %>% as.factor %>% unique
Dataset## [1] Excellent Outstanding
## Levels: Excellent Outstanding
#3) WorklifeBalance
gsub(1, 'Bad',Dataset$WorkLifeBalance) -> Dataset$WorkLifeBalance
gsub(2, 'Good',Dataset$WorkLifeBalance) -> Dataset$WorkLifeBalance
gsub(3, 'Better',Dataset$WorkLifeBalance) -> Dataset$WorkLifeBalance
gsub(4, 'Best',Dataset$WorkLifeBalance) -> Dataset$WorkLifeBalance
$WorkLifeBalance %>% as.factor %>% unique
Dataset## [1] Bad Better Good Best
## Levels: Bad Best Better Good
#4) JobInvolvement
gsub(1, 'Low',Dataset$JobInvolvement) -> Dataset$JobInvolvement
gsub(2, 'Medium',Dataset$JobInvolvement) -> Dataset$JobInvolvement
gsub(3, 'High',Dataset$JobInvolvement) -> Dataset$JobInvolvement
gsub(4, 'Very High',Dataset$JobInvolvement) -> Dataset$JobInvolvement
$JobInvolvement %>% as.factor %>% unique
Dataset## [1] High Medium Very High Low
## Levels: High Low Medium Very High
#5) EnvironmentSatisfaction
gsub(1, 'Low',Dataset$EnvironmentSatisfaction) -> Dataset$EnvironmentSatisfaction
gsub(2, 'Medium',Dataset$EnvironmentSatisfaction) -> Dataset$EnvironmentSatisfaction
gsub(3, 'High',Dataset$EnvironmentSatisfaction) -> Dataset$EnvironmentSatisfaction
gsub(4, 'Very High',Dataset$EnvironmentSatisfaction) -> Dataset$EnvironmentSatisfaction
$EnvironmentSatisfaction %>% as.factor %>% unique
Dataset## [1] Medium High Very High Low
## Levels: High Low Medium Very High
#6) JobSatisfaction
gsub(1, 'Low',Dataset$JobSatisfaction) -> Dataset$JobSatisfaction
gsub(2, 'Medium',Dataset$JobSatisfaction) -> Dataset$JobSatisfaction
gsub(3, 'High',Dataset$JobSatisfaction) -> Dataset$JobSatisfaction
gsub(4, 'Very High',Dataset$JobSatisfaction) -> Dataset$JobSatisfaction
$JobSatisfaction %>% as.factor %>% unique
Dataset## [1] Very High Medium High Low
## Levels: High Low Medium Very High
#7) RelationshipSatisfaction
gsub(1, 'Low',Dataset$RelationshipSatisfaction) -> Dataset$RelationshipSatisfaction
gsub(2, 'Medium',Dataset$RelationshipSatisfaction) -> Dataset$RelationshipSatisfaction
gsub(3, 'High',Dataset$RelationshipSatisfaction) -> Dataset$RelationshipSatisfaction
gsub(4, 'Very High',Dataset$RelationshipSatisfaction) -> Dataset$RelationshipSatisfaction
$RelationshipSatisfaction %>% as.factor %>% unique
Dataset## [1] Low Very High Medium High
## Levels: High Low Medium Very High
# character , numeric variable 수 확인
%>% diagnose() %>% dplyr::select(types) %>% table
Dataset ## .
## character numeric
## 15 16
# Missing value Check
%>% naniar::gg_miss_var() Dataset
# 1차로 변환한 데이터셋을 저장해둡니다.
saveRDS(Dataset, "Dataset.RDS")
# categorical variable 확인
%>% diagnose_category()
Dataset ## # A tibble: 57 x 6
## variables levels N freq ratio rank
## <chr> <chr> <int> <int> <dbl> <int>
## 1 Attrition No 1470 1233 83.9 1
## 2 Attrition Yes 1470 237 16.1 2
## 3 BusinessTravel Travel_Rarely 1470 1043 71.0 1
## 4 BusinessTravel Travel_Frequently 1470 277 18.8 2
## 5 BusinessTravel Non-Travel 1470 150 10.2 3
## 6 Department Research & Development 1470 961 65.4 1
## 7 Department Sales 1470 446 30.3 2
## 8 Department Human Resources 1470 63 4.29 3
## 9 Education Bachelor 1470 572 38.9 1
## 10 Education Master 1470 398 27.1 2
## # … with 47 more rows
# Numeric variable 확인
%>% diagnose_numeric()
Dataset ## # A tibble: 16 x 10
## variables min Q1 mean median Q3 max zero minus outlier
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int> <int>
## 1 Age 18 30 3.69e+1 36 4.3 e1 60 0 0 0
## 2 DailyRate 102 465 8.02e+2 802 1.16e3 1499 0 0 0
## 3 DistanceFromHome 1 2 9.19e+0 7 1.4 e1 29 0 0 0
## 4 HourlyRate 30 48 6.59e+1 66 8.38e1 100 0 0 0
## 5 JobLevel 1 1 2.06e+0 2 3 e0 5 0 0 0
## 6 MonthlyIncome 1009 2911 6.50e+3 4919 8.38e3 19999 0 0 114
## 7 MonthlyRate 2094 8047 1.43e+4 14236. 2.05e4 26999 0 0 0
## 8 NumCompaniesWor… 0 1 2.69e+0 2 4 e0 9 197 0 52
## 9 PercentSalaryHi… 11 12 1.52e+1 14 1.8 e1 25 0 0 0
## 10 StockOptionLevel 0 0 7.94e-1 1 1 e0 3 631 0 85
## 11 TotalWorkingYea… 0 6 1.13e+1 10 1.5 e1 40 11 0 63
## 12 TrainingTimesLa… 0 2 2.80e+0 3 3 e0 6 54 0 238
## 13 YearsAtCompany 0 3 7.01e+0 5 9 e0 40 44 0 104
## 14 YearsInCurrentR… 0 2 4.23e+0 3 7 e0 18 244 0 21
## 15 YearsSinceLastP… 0 0 2.19e+0 1 3 e0 15 581 0 107
## 16 YearsWithCurrMa… 0 2 4.12e+0 3 7 e0 17 263 0 14
# outlier 개수가 많은 순으로 정렬하여 변수 확인
%>% diagnose_outlier() %>% arrange(desc(outliers_cnt))
Dataset ## # A tibble: 16 x 6
## variables outliers_cnt outliers_ratio outliers_mean with_mean without_mean
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 TrainingTim… 238 16.2 4.14 2.80 2.54
## 2 MonthlyInco… 114 7.76 18400. 6503. 5503.
## 3 YearsSinceL… 107 7.28 11.1 2.19 1.48
## 4 YearsAtComp… 104 7.07 23.5 7.01 5.75
## 5 StockOption… 85 5.78 3 0.794 0.658
## 6 TotalWorkin… 63 4.29 32.6 11.3 10.3
## 7 NumCompanie… 52 3.54 9 2.69 2.46
## 8 YearsInCurr… 21 1.43 16 4.23 4.06
## 9 YearsWithCu… 14 0.952 16.1 4.12 4.01
## 10 Age 0 0 NaN 36.9 36.9
## 11 DailyRate 0 0 NaN 802. 802.
## 12 DistanceFro… 0 0 NaN 9.19 9.19
## 13 HourlyRate 0 0 NaN 65.9 65.9
## 14 JobLevel 0 0 NaN 2.06 2.06
## 15 MonthlyRate 0 0 NaN 14313. 14313.
## 16 PercentSala… 0 0 NaN 15.2 15.2
# outlier 비율이 5 이상인 변수 확인
%>% diagnose_outlier() %>% filter(outliers_ratio > 5) %>%
Dataset mutate(rate = outliers_mean / with_mean) %>%
arrange(desc(rate)) %>% dplyr::select(-outliers_cnt)
## # A tibble: 5 x 6
## variables outliers_ratio outliers_mean with_mean without_mean rate
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 YearsSinceLastPromo… 7.28 11.1 2.19 1.48 5.09
## 2 StockOptionLevel 5.78 3 0.794 0.658 3.78
## 3 YearsAtCompany 7.07 23.5 7.01 5.75 3.36
## 4 MonthlyIncome 7.76 18400. 6503. 5503. 2.83
## 5 TrainingTimesLastYe… 16.2 4.14 2.80 2.54 1.48
YearsSinceLastPromotion와 StockOptionLevel, YearsAtCompany, MonthlyIncome,
Training Times Last Year 변수는 전체 평균보다 이상치의 평균이 큰 것 같습니다.
이상치의 평균과 전체평균의 비율(rate)이 큰 경우에는 대체하거나 제거하는 것이 좋습니다.
하지만, 실제 업무 환경을 고려하면, 근속 연수나 스톡옵션 레벨, 승진 연차, 월급, 교육시간은
충분히 outlier가 있을 수 있고, 이러한 outlier가 실제 Attrition에 영향을 미칠 수 있습니다.
이상치가 포함된 관측치의 descriptive statistics를 보며 제거해야 하는지 확인해보거나,
%>% dplyr::select(find_outliers(.)) %>% describe()
Dataset ## # A tibble: 9 x 26
## variable n na mean sd se_mean IQR skewness kurtosis p00
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 MonthlyInco… 1470 0 6.50e+3 4.71e+3 1.23e+2 5468 1.37 1.01 1009
## 2 NumCompanie… 1470 0 2.69e+0 2.50e+0 6.52e-2 3 1.03 0.0102 0
## 3 StockOption… 1470 0 7.94e-1 8.52e-1 2.22e-2 1 0.969 0.365 0
## 4 TotalWorkin… 1470 0 1.13e+1 7.78e+0 2.03e-1 9 1.12 0.918 0
## 5 TrainingTim… 1470 0 2.80e+0 1.29e+0 3.36e-2 1 0.553 0.495 0
## 6 YearsAtComp… 1470 0 7.01e+0 6.13e+0 1.60e-1 6 1.76 3.94 0
## 7 YearsInCurr… 1470 0 4.23e+0 3.62e+0 9.45e-2 5 0.917 0.477 0
## 8 YearsSinceL… 1470 0 2.19e+0 3.22e+0 8.40e-2 3 1.98 3.61 0
## 9 YearsWithCu… 1470 0 4.12e+0 3.57e+0 9.31e-2 5 0.833 0.171 0
## # … with 16 more variables: p01 <dbl>, p05 <dbl>, p10 <dbl>, p20 <dbl>,
## # p25 <dbl>, p30 <dbl>, p40 <dbl>, p50 <dbl>, p60 <dbl>, p70 <dbl>,
## # p75 <dbl>, p80 <dbl>, p90 <dbl>, p95 <dbl>, p99 <dbl>, p100 <dbl>
Dataset plot_outlier(diagnose_outlier(Dataset) %>%
filter(outliers_ratio >= 0.5) %>%
::select(variables) %>%
# dlookr 패키지 기반으로 아래 코드 한번이면, 레포트로 확인하실 수 있습니다.
# Dataset %>% diagnose_web_report()
%>% dplyr::select(MonthlyIncome) %>% plot_box_numeric() Dataset
%>% dplyr::select(-MonthlyIncome) -> Dataset Dataset
# Numeric variable만 추출하여 Multivariate Oultier를 구합니다.
# cut off value = .99로 설정했습니다.
%>% purrr::keep(is.numeric) -> outcheck_num
Dataset %>% chemometrics::Moutlier(quantile=.99)-> Mout outcheck_num
# 원본 데이터셋과 다시 합침
%>% mutate(md=Mout$md)->Dataset
# Cutoff value = 6.015885 이상인 값 확인
%>% filter(md>Mout$cutoff) %>% nrow
Dataset ## [1] 68
보다 높은 68개 값의 Diversity를 확인합니다. # Original Dataset Attrition ratio
%>% dplyr::select(Attrition) %>% diagnose_category()
Dataset ## # A tibble: 2 x 6
## variables levels N freq ratio rank
## <chr> <chr> <int> <int> <dbl> <int>
## 1 Attrition No 1470 1233 83.9 1
## 2 Attrition Yes 1470 237 16.1 2
# Multivariate Outlier Dataset Attrition ratio
%>% filter(md>Mout$cutoff) %>% dplyr::select(Attrition) %>% diagnose_category()
Dataset ## # A tibble: 2 x 6
## variables levels N freq ratio rank
## <chr> <chr> <int> <int> <dbl> <int>
## 1 Attrition No 68 59 86.8 1
## 2 Attrition Yes 68 9 13.2 2
# 총 68개의 Multivariate Ouliter 관측치 발견하고, outlier와 md 제거하기
%>% filter(md<Mout$cutoff) -> Dataset
Dataset %>% dplyr::select(-md) -> Dataset
# 1차 정제된 Dataset을 다시 저장
# 1차 정제된 데이터셋 기반으로 ExPanDar 패키지를 통해 탐색적 분석을 진행합니다
# Correlation, Scatterplot 등 파악 가능하며, Web 기반으로 동작합니다.
# Dataset %>% ExPanD()
실제 탐색적 분석을 진행할 때는 ExPanDar 패키지 등을 사용하여 아래와 같이 Variable간의 관계를 다양한 관점에서 파악합니다.
이것으로 Data에 대한 preprocessing과 EDA가 마무리 되었습니다.
# Data Import
<- readRDS("Dataset_pre.RDS")
# split the Data set and set the reference level
%>% mutate_if(is.character, factor)->Data_glm
# Setting Reference level
$Attrition <- relevel(Data_glm$Attrition, ref = "Yes")
Data_glm$Attrition %>% levels Data_glm
## [1] "Yes" "No"
<- initial_split(Data_glm, prop = .7, strata = Attrition)
split <- training(split)
glm_train <- testing(split)
%>% nrow glm_train
## [1] 982
%>% nrow glm_test
## [1] 420
recipe를 활용하여 multicollinearity check, dummy coded, normalization를 진행합니다.
recipe를 보면, multicollinearity로 인해 제거된 변수는 없고,
Dummy code화와 Normalization이 잘 되었음을 확인할 수 있습니다.
# pre-processing by recipe
%>% recipe(Attrition~.) %>%
glm_train step_normalize(all_numeric()) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_corr(all_predictors()) %>% prep()-> glm_recipe
## Data Recipe
## Inputs:
## role #variables
## outcome 1
## predictor 29
## Training data contained 982 data points and no missing data.
## Operations:
## Centering and scaling for Age, DailyRate, DistanceFromHome, ... [trained]
## Dummy variables from BusinessTravel, Department, Education, ... [trained]
## Correlation filter removed no terms [trained]
# make the test data set
%>% juice -> glm_train_re
# bake the test data set
%>% bake(glm_test) -> glm_test_re glm_recipe
# Model Setting
<- logistic_reg() %>%
glm_model set_engine('glm') %>%
glm_model## Logistic Regression Model Specification (classification)
## Computational engine: glm
# Fitting Logistic Regression
<- glm_model %>% fit(Attrition ~., data=glm_train_re)
# Attrition에 영향을 주는 요인을 살펴봅니다.
tidy(glm_fit, exponentiate=T) %>% filter(p.value<.05)
## # A tibble: 18 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Age 1.42 0.155 2.25 2.47e- 2
## 2 DailyRate 1.26 0.116 1.96 4.97e- 2
## 3 DistanceFromHome 0.688 0.111 -3.36 7.69e- 4
## 4 NumCompaniesWorked 0.642 0.131 -3.37 7.57e- 4
## 5 YearsInCurrentRole 1.72 0.234 2.32 2.01e- 2
## 6 YearsSinceLastPromotion 0.535 0.168 -3.72 2.02e- 4
## 7 BusinessTravel_Travel_Frequently 0.134 0.535 -3.76 1.67e- 4
## 8 BusinessTravel_Travel_Rarely 0.369 0.489 -2.04 4.14e- 2
## 9 EnvironmentSatisfaction_Low 0.285 0.325 -3.86 1.11e- 4
## 10 Gender_Male 0.541 0.249 -2.47 1.34e- 2
## 11 JobInvolvement_Low 0.203 0.417 -3.82 1.33e- 4
## 12 JobRole_Laboratory.Technician 0.253 0.618 -2.22 2.62e- 2
## 13 JobSatisfaction_Very.High 2.25 0.311 2.60 9.24e- 3
## 14 MaritalStatus_Single 0.310 0.441 -2.66 7.85e- 3
## 15 OverTime_Yes 0.111 0.264 -8.33 7.77e-17
## 16 RelationshipSatisfaction_Low 0.438 0.325 -2.53 1.13e- 2
## 17 WorkLifeBalance_Better 5.11 0.440 3.71 2.07e- 4
## 18 WorkLifeBalance_Good 2.59 0.463 2.06 3.98e- 2
# Model Prediction
<- glm_fit %>% predict(new_data=glm_test_re, type="class")
pre_class %>% head
pre_class ## # A tibble: 6 x 1
## .pred_class
## <fct>
## 1 Yes
## 2 No
## 3 No
## 4 Yes
## 5 No
## 6 No
<- glm_fit %>% predict(new_data=glm_test_re, type="prob")
pre_prob %>% head
pre_prob ## # A tibble: 6 x 2
## .pred_Yes .pred_No
## <dbl> <dbl>
## 1 0.674 0.326
## 2 0.333 0.667
## 3 0.0371 0.963
## 4 0.800 0.200
## 5 0.00743 0.993
## 6 0.0659 0.934
<- glm_test_re %>%
evaluation_tbl ::select(Attrition) %>% bind_cols(pre_class) %>%
evaluation_tbl## # A tibble: 420 x 4
## Attrition .pred_class .pred_Yes .pred_No
## <fct> <fct> <dbl> <dbl>
## 1 Yes Yes 0.674 0.326
## 2 Yes No 0.333 0.667
## 3 No No 0.0371 0.963
## 4 Yes Yes 0.800 0.200
## 5 No No 0.00743 0.993
## 6 No No 0.0659 0.934
## 7 No No 0.000470 1.00
## 8 Yes No 0.214 0.786
## 9 No No 0.00000000245 1.00
## 10 Yes No 0.100 0.900
## # … with 410 more rows
conf_mat(evaluation_tbl, truth = Attrition, estimate = .pred_class)
## Truth
## Prediction Yes No
## Yes 33 14
## No 35 338
conf_mat(evaluation_tbl, truth = Attrition, estimate = .pred_class) %>% summary
## # A tibble: 13 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.883
## 2 kap binary 0.509
## 3 sens binary 0.485
## 4 spec binary 0.960
## 5 ppv binary 0.702
## 6 npv binary 0.906
## 7 mcc binary 0.521
## 8 j_index binary 0.446
## 9 bal_accuracy binary 0.723
## 10 detection_prevalence binary 0.112
## 11 precision binary 0.702
## 12 recall binary 0.485
## 13 f_meas binary 0.574
roc_auc(evaluation_tbl, truth = Attrition, .pred_Yes)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.887
%>% roc_curve(truth=Attrition, .pred_Yes) %>% autoplot() evaluation_tbl
%>% as.data.frame %>% SMOTE_NC('Attrition')->glm_train_SMOTE glm_train
%>% dplyr::select(Attrition) %>% diagnose_category()
glm_train ## # A tibble: 2 x 6
## variables levels N freq ratio rank
## <chr> <chr> <int> <int> <dbl> <int>
## 1 Attrition No 982 822 83.7 1
## 2 Attrition Yes 982 160 16.3 2
%>% dplyr::select(Attrition) %>% diagnose_category()
glm_train_SMOTE ## # A tibble: 2 x 6
## variables levels N freq ratio rank
## <chr> <chr> <int> <int> <dbl> <int>
## 1 Attrition No 1644 822 50 1
## 2 Attrition Yes 1644 822 50 1
# pre-processing by recipe
%>% recipe(Attrition~.) %>%
glm_train_SMOTE step_normalize(all_numeric()) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_corr(all_predictors()) %>% prep()-> glm_recipe
glm_recipe## Data Recipe
## Inputs:
## role #variables
## outcome 1
## predictor 29
## Training data contained 1644 data points and no missing data.
## Operations:
## Centering and scaling for Age, DailyRate, DistanceFromHome, ... [trained]
## Dummy variables from BusinessTravel, Department, Education, ... [trained]
## Correlation filter removed Department_Research...Development [trained]
# make the test data set
%>% juice -> glm_train_re
# bake the train data set
%>% bake(glm_test) -> glm_test_re
# Model Setting
<- logistic_reg() %>%
glm_model set_engine('glm') %>%
glm_model## Logistic Regression Model Specification (classification)
## Computational engine: glm
# Fitting Logistic Regression
<- glm_model %>% fit(Attrition ~., data=glm_train_re)
tidy(glm_fit, exponentiate=T) %>% filter(p.value<.05)
## # A tibble: 32 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 190. 0.974 5.39 0.0000000706
## 2 Age 1.31 0.0936 2.88 0.00403
## 3 DailyRate 1.42 0.0786 4.48 0.00000758
## 4 DistanceFromHome 0.677 0.0795 -4.91 0.000000930
## 5 NumCompaniesWorked 0.687 0.0852 -4.40 0.0000106
## 6 PercentSalaryHike 0.816 0.0957 -2.12 0.0336
## 7 TotalWorkingYears 1.42 0.136 2.57 0.0102
## 8 TrainingTimesLastYear 1.18 0.0759 2.21 0.0273
## 9 YearsInCurrentRole 1.56 0.137 3.24 0.00121
## 10 YearsSinceLastPromotion 0.576 0.103 -5.37 0.0000000800
## # … with 22 more rows
# Model Prediction
<- glm_fit %>% predict(new_data=glm_test_re, type="class")
pre_class2 %>% head pre_class2
## # A tibble: 6 x 1
## .pred_class
## <fct>
## 1 Yes
## 2 Yes
## 3 No
## 4 Yes
## 5 No
## 6 No
<- glm_fit %>% predict(new_data=glm_test_re, type="prob")
pre_prob2 %>% head pre_prob2
## # A tibble: 6 x 2
## .pred_Yes .pred_No
## <dbl> <dbl>
## 1 0.581 0.419
## 2 0.645 0.355
## 3 0.102 0.898
## 4 0.985 0.0154
## 5 0.00205 0.998
## 6 0.00592 0.994
<- glm_test_re %>%
evaluation_tbl2 ::select(Attrition) %>% bind_cols(pre_class2) %>%
## # A tibble: 420 x 4
## Attrition .pred_class .pred_Yes .pred_No
## <fct> <fct> <dbl> <dbl>
## 1 Yes Yes 0.581 0.419
## 2 Yes Yes 0.645 0.355
## 3 No No 0.102 0.898
## 4 Yes Yes 0.985 0.0154
## 5 No No 0.00205 0.998
## 6 No No 0.00592 0.994
## 7 No No 0.00132 0.999
## 8 Yes No 0.372 0.628
## 9 No No 0.0000000108 1.00
## 10 Yes No 0.499 0.501
## # … with 410 more rows
# Evaluation
conf_mat(evaluation_tbl2, truth = Attrition, estimate = .pred_class)
## Truth
## Prediction Yes No
## Yes 46 65
## No 22 287
conf_mat(evaluation_tbl2, truth = Attrition, estimate = .pred_class) %>% summary
## # A tibble: 13 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.793
## 2 kap binary 0.392
## 3 sens binary 0.676
## 4 spec binary 0.815
## 5 ppv binary 0.414
## 6 npv binary 0.929
## 7 mcc binary 0.411
## 8 j_index binary 0.492
## 9 bal_accuracy binary 0.746
## 10 detection_prevalence binary 0.264
## 11 precision binary 0.414
## 12 recall binary 0.676
## 13 f_meas binary 0.514
roc_auc(evaluation_tbl2, truth = Attrition, .pred_Yes)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.830
%>% roc_curve(truth=Attrition, .pred_Yes) %>% autoplot() evaluation_tbl
# Data Import
<- readRDS("Dataset_pre.RDS")
# split the Data set and set the reference level
%>% mutate_if(is.character, factor)->Data_glm
# Setting Reference level
$Attrition <- relevel(Data_glm$Attrition, ref = "Yes")
Data_glm$Attrition %>% levels Data_glm
## [1] "Yes" "No"
<- initial_split(Data_glm, prop = .7, strata = Attrition)
split <- training(split)
glm_train <- testing(split)
%>% nrow glm_train
## [1] 982
%>% nrow glm_test
## [1] 420
# pre-processing by recipe
%>% recipe(Attrition~.) %>%
glm_train step_normalize(all_numeric()) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_corr(all_predictors()) %>% prep()-> glm_recipe
## Data Recipe
## Inputs:
## role #variables
## outcome 1
## predictor 29
## Training data contained 982 data points and no missing data.
## Operations:
## Centering and scaling for Age, DailyRate, DistanceFromHome, ... [trained]
## Dummy variables from BusinessTravel, Department, Education, ... [trained]
## Correlation filter removed no terms [trained]
# make the test data set
%>% juice -> glm_train_re
# bake the test data set
%>% bake(glm_test) -> glm_test_re glm_recipe
# Model Improvement
glm(Attrition~., family = 'binomial', data=glm_train_re) %>%
::stepAIC(direction = "backward") -> step_glm
<- glm_model %>% fit(step_glm$formula, data=glm_train_re)
# Improved Model Prediction
<- glm_fit_mod %>% predict(new_data=glm_test_re, type="class")
pre_class_re %>% head
pre_class_re <- glm_fit_mod %>% predict(new_data=glm_test_re, type="prob")
pre_prob_re %>% head
pre_prob_re <- glm_test_re %>%
evaluation_tbl_mod ::select(Attrition) %>% bind_cols(pre_class_re) %>%
# Improved Model Evaluation
conf_mat(evaluation_tbl_mod, truth = Attrition, estimate = .pred_class)
## Truth
## Prediction Yes No
## Yes 31 14
## No 37 338
conf_mat(evaluation_tbl_mod, truth = Attrition, estimate = .pred_class) %>% summary
## # A tibble: 13 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.879
## 2 kap binary 0.482
## 3 sens binary 0.456
## 4 spec binary 0.960
## 5 ppv binary 0.689
## 6 npv binary 0.901
## 7 mcc binary 0.496
## 8 j_index binary 0.416
## 9 bal_accuracy binary 0.708
## 10 detection_prevalence binary 0.107
## 11 precision binary 0.689
## 12 recall binary 0.456
## 13 f_meas binary 0.549
roc_auc(evaluation_tbl_mod, truth = Attrition, .pred_Yes)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.880
%>% roc_curve(truth=Attrition, .pred_Yes) %>% autoplot() evaluation_tbl_mod
vip## # A tibble: 26 x 3
## Variable Importance Sign
## <chr> <dbl> <chr>
## 1 OverTime_Yes 8.47 NEG
## 2 EnvironmentSatisfaction_Low 5.00 NEG
## 3 MaritalStatus_Single 4.62 NEG
## 4 BusinessTravel_Travel_Frequently 3.86 NEG
## 5 YearsSinceLastPromotion 3.77 NEG
## 6 NumCompaniesWorked 3.77 NEG
## 7 JobInvolvement_Low 3.71 NEG
## 8 EducationField_Life.Sciences 3.65 POS
## 9 WorkLifeBalance_Better 3.61 POS
## 10 EducationField_Medical 3.44 POS
## # … with 16 more rows
<- readRDS("Dataset_pre.RDS")
# split the Data set and set the reference level
%>% mutate_if(is.character, factor)->Data_rnd
# Setting Reference level
$Attrition <- relevel(Data_rnd$Attrition, ref = "Yes")
Data_rnd$Attrition %>% levels
Data_rnd## [1] "Yes" "No"
<- initial_split(Data_rnd, prop = .7, strata = Attrition)
split <- training(split)
rnd_train <- testing(split)
%>% as.data.frame %>% SMOTE_NC('Attrition')->rnd_train_SMOTE
# pre-processing by recipe
%>% recipe(Attrition~.) %>%
rnd_train_SMOTE step_normalize(all_numeric()) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_corr(all_predictors()) %>% prep()-> rnd_recipe
rnd_recipe## Data Recipe
## Inputs:
## role #variables
## outcome 1
## predictor 29
## Training data contained 1644 data points and no missing data.
## Operations:
## Centering and scaling for Age, DailyRate, DistanceFromHome, ... [trained]
## Dummy variables from BusinessTravel, Department, Education, ... [trained]
## Correlation filter removed Department_Research...Development [trained]
# make the test data set
%>% juice -> rnd_train_re
# bake the train data set
%>% bake(rnd_test) -> rnd_test_re rnd_recipe
# make validation set
<- vfold_cv(rnd_train_re) data_fold
# hyperparameter tune : mtry와 min_n만 설정, 개인 컴퓨터 core는 8개라 병렬 처리위한 thread는 6으로 설정
<- rand_forest(mtry=tune(), trees = 1000, min_n = tune()) %>%
tune_spec set_mode("classification") %>% set_engine('ranger', importance='impurity',seed=2727, num.threads=6)
## Random Forest Model Specification (classification)
## Main Arguments:
## mtry = tune()
## trees = 1000
## min_n = tune()
## Engine-Specific Arguments:
## importance = impurity
## seed = 2727
## num.threads = 6
## Computational engine: ranger
workflow() %>%
add_model(tune_spec) %>%
add_formula(Attrition ~ .)-> workflow
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Formula
## Model: rand_forest()
## ─ Preprocessor ────────────────────────────────
## Attrition ~ .
## ─ Model ────────────────────────────────────
## Random Forest Model Specification (classification)
## Main Arguments:
## mtry = tune()
## trees = 1000
## min_n = tune()
## Engine-Specific Arguments:
## importance = impurity
## seed = 2727
## num.threads = 6
## Computational engine: ranger
<- workflow %>%
control=control_grid(save_pred = TRUE),
metrics=metric_set(roc_auc, sens, spec, recall, accuracy, precision, f_meas))
# Graph for hyperparameter tuning
rnd_model collect_metrics() %>%
filter(.metric == "roc_auc") %>%
::select(mean, min_n, mtry) %>%
values_to = "value",
names_to = "parameter") %>%
ggplot(aes(value, mean, color = parameter)) +
geom_point(show.legend = FALSE) +
facet_wrap(~parameter, scales = "free_x") +
labs(x = NULL, y = "AUC")
rnd_model collect_metrics()
## # A tibble: 140 x 8
## mtry min_n .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 46 18 accuracy binary 0.886 10 0.00767 Preprocessor1_Model01
## 2 46 18 f_meas binary 0.880 10 0.00862 Preprocessor1_Model01
## 3 46 18 precision binary 0.918 10 0.0131 Preprocessor1_Model01
## 4 46 18 recall binary 0.848 10 0.0160 Preprocessor1_Model01
## 5 46 18 roc_auc binary 0.956 10 0.00392 Preprocessor1_Model01
## 6 46 18 sens binary 0.848 10 0.0160 Preprocessor1_Model01
## 7 46 18 spec binary 0.926 10 0.0108 Preprocessor1_Model01
## 8 20 29 accuracy binary 0.894 10 0.00712 Preprocessor1_Model02
## 9 20 29 f_meas binary 0.888 10 0.00873 Preprocessor1_Model02
## 10 20 29 precision binary 0.928 10 0.0115 Preprocessor1_Model02
## # … with 130 more rows
# Logistic Regression은 recall로 퇴직자에 대한 예측 정확도만 고려했으나,
# random forest부터는 recall과 재직자도 고려한 specificity를 동시에 고려하는 AUC 고려
%>% select_best('roc_auc')->param_best rnd_model
%>% finalize_model(param_best)->rnd_best_model tune_spec
# workflow update
%>% finalize_workflow(param_best) -> workflow_final
%>% last_fit(split, metrics=metric_set(roc_auc, sens, spec, recall, accuracy, precision, f_meas))->rnd_best_fit2
%>% collect_predictions() %>%
rnd_best_fit2 conf_mat(truth = Attrition, estimate=.pred_class)
## Truth
## Prediction Yes No
## Yes 13 12
## No 55 340
%>% collect_predictions() %>% roc_curve(truth=Attrition, estimate=.pred_Yes) %>% autoplot() rnd_best_fit2
<- fit(workflow_final, Data_glm)
pull_workflow_fit(deploy_randf)$fit %>% vip::vi()
## # A tibble: 29 x 2
## Variable Importance
## <chr> <dbl>
## 1 Age 28.4
## 2 OverTime 27.0
## 3 DailyRate 24.0
## 4 TotalWorkingYears 22.3
## 5 DistanceFromHome 22.1
## 6 HourlyRate 21.1
## 7 MonthlyRate 19.9
## 8 YearsAtCompany 14.1
## 9 NumCompaniesWorked 13.4
## 10 PercentSalaryHike 13.2
## # … with 19 more rows
# Importance값이 10 이상인 Variable들만 갖고 Randomforest 다시 돌려보고자 합니다.
<- readRDS("Dataset_pre.RDS")
# split the Data set and set the reference level
%>% mutate_if(is.character, factor)->Data_rnd
# Setting Reference level
$Attrition <- relevel(Data_rnd$Attrition, ref = "Yes")
Data_rnd$Attrition %>% levels
Data_rnd## [1] "Yes" "No"
# pre-processing by recipe
%>% recipe(Attrition~.) %>%
Dataset step_normalize(all_numeric()) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_corr(all_predictors()) %>% prep()-> rnd_recipe_re
%>% juice-> rnd_dataset
<- initial_split(rnd_dataset, prop = .7, strata = Attrition)
split_re <- training(split_re)
rnd_train <- testing(split_re)
rnd_test %>% as.data.frame %>% SMOTE('Attrition')->rnd_train_SMOTE
%>% dplyr::select(Attrition, Age , DailyRate , DistanceFromHome , NumCompaniesWorked ,
TotalWorkingYears , TrainingTimesLastYear , YearsInCurrentRole ,
YearsSinceLastPromotion , BusinessTravel_Travel_Frequently ,
BusinessTravel_Travel_Rarely , EducationField_Life.Sciences ,
EducationField_Medical , EducationField_Other , EnvironmentSatisfaction_Low ,
Gender_Male , JobInvolvement_Low , JobInvolvement_Very.High ,
JobRole_Laboratory.Technician , JobRole_Research.Director ,
JobRole_Sales.Representative , JobSatisfaction_Low , JobSatisfaction_Very.High ,
MaritalStatus_Single , OverTime_Yes , RelationshipSatisfaction_Low , -> rnd_train_re_sel
%>% dplyr::select(Attrition, Age , DailyRate , DistanceFromHome , NumCompaniesWorked ,
TotalWorkingYears , TrainingTimesLastYear , YearsInCurrentRole ,
YearsSinceLastPromotion , BusinessTravel_Travel_Frequently ,
BusinessTravel_Travel_Rarely , EducationField_Life.Sciences ,
EducationField_Medical , EducationField_Other , EnvironmentSatisfaction_Low ,
Gender_Male , JobInvolvement_Low , JobInvolvement_Very.High ,
JobRole_Laboratory.Technician , JobRole_Research.Director ,
JobRole_Sales.Representative , JobSatisfaction_Low , JobSatisfaction_Very.High ,
MaritalStatus_Single , OverTime_Yes , RelationshipSatisfaction_Low , -> rnd_test_re_sel
# Validate data again
<- vfold_cv(rnd_train_re_sel)
# Workflow setting
workflow() %>%
add_model(tune_spec) %>%
add_formula(step_glm$formula)-> workflow2
workflow2## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Formula
## Model: rand_forest()
## ─ Preprocessor ────────────────────────────────
## Attrition ~ Age + DailyRate + DistanceFromHome + NumCompaniesWorked +
## TotalWorkingYears + TrainingTimesLastYear + YearsInCurrentRole +
## YearsSinceLastPromotion + BusinessTravel_Travel_Frequently +
## BusinessTravel_Travel_Rarely + EducationField_Life.Sciences +
## EducationField_Medical + EducationField_Other + EnvironmentSatisfaction_Low +
## Gender_Male + JobInvolvement_Low + JobInvolvement_Very.High +
## JobRole_Laboratory.Technician + JobRole_Research.Director +
## JobRole_Sales.Representative + JobSatisfaction_Low + JobSatisfaction_Very.High +
## MaritalStatus_Single + OverTime_Yes + RelationshipSatisfaction_Low +
## WorkLifeBalance_Better
## ─ Model ────────────────────────────────────
## Random Forest Model Specification (classification)
## Main Arguments:
## mtry = tune()
## trees = 1000
## min_n = tune()
## Engine-Specific Arguments:
## importance = impurity
## seed = 2727
## num.threads = 6
## Computational engine: ranger
# hyperparameter tune
<- workflow2 %>%
control=control_grid(save_pred = TRUE),
metrics=metric_set(roc_auc, sens, spec, recall, accuracy, precision, f_meas))
# Graph for hyperparameter tuning
rnd_model_mod collect_metrics() %>%
filter(.metric == "roc_auc") %>%
::select(mean, min_n, mtry) %>%
values_to = "value",
names_to = "parameter") %>%
ggplot(aes(value, mean, color = parameter)) +
geom_point(show.legend = FALSE) +
facet_wrap(~parameter, scales = "free_x") +
labs(x = NULL, y = "AUC")
rnd_model_mod collect_metrics()
## # A tibble: 140 x 8
## mtry min_n .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 5 12 accuracy binary 0.857 10 0.0115 Preprocessor1_Model01
## 2 5 12 f_meas binary 0.921 10 0.00716 Preprocessor1_Model01
## 3 5 12 precision binary 0.857 10 0.0126 Preprocessor1_Model01
## 4 5 12 recall binary 0.995 10 0.00195 Preprocessor1_Model01
## 5 5 12 roc_auc binary 0.804 10 0.0183 Preprocessor1_Model01
## 6 5 12 sens binary 0.995 10 0.00195 Preprocessor1_Model01
## 7 5 12 spec binary 0.149 10 0.00974 Preprocessor1_Model01
## 8 5 37 accuracy binary 0.852 10 0.0132 Preprocessor1_Model02
## 9 5 37 f_meas binary 0.918 10 0.00799 Preprocessor1_Model02
## 10 5 37 precision binary 0.851 10 0.0136 Preprocessor1_Model02
## # … with 130 more rows
# Logistic Regression은 recall로 퇴직자에 대한 예측 정확도만 고려했으나,
# random forest부터는 recall과 재직자도 고려한 specificity를 동시에 고려하는 AUC 고려
%>% select_best('roc_auc')->param_best_mod rnd_model_mod
%>% finalize_model(param_best_mod)->rnd_best_model
# workflow update
%>% finalize_workflow(param_best_mod) -> workflow_final2
%>% last_fit(split_re, metrics=metric_set(roc_auc, sens, spec, recall, accuracy, precision, f_meas))->rnd_best_fit3
%>% collect_predictions() %>%
rnd_best_fit3 conf_mat(truth = Attrition, estimate=.pred_class)
## Truth
## Prediction No Yes
## No 351 65
## Yes 1 3
%>% collect_metrics() %>% arrange(desc(.estimate)) rnd_best_fit3
## # A tibble: 7 x 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 sens binary 0.997 Preprocessor1_Model1
## 2 recall binary 0.997 Preprocessor1_Model1
## 3 f_meas binary 0.914 Preprocessor1_Model1
## 4 precision binary 0.844 Preprocessor1_Model1
## 5 accuracy binary 0.843 Preprocessor1_Model1
## 6 roc_auc binary 0.834 Preprocessor1_Model1
## 7 spec binary 0.0441 Preprocessor1_Model1
# model deploy
<- fit(workflow_final, Data_glm)
deploy_randf deploy_randf
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Formula
## Model: rand_forest()
## ─ Preprocessor ────────────────────────────────
## Attrition ~ .
## ─ Model ────────────────────────────────────
## Ranger result
## Call:
## ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~13L, x), num.trees = ~1000, min.node.size = min_rows(~5L, x), importance = ~"impurity", seed = ~2727, num.threads = ~6, verbose = FALSE, probability = TRUE)
## Type: Probability estimation
## Number of trees: 1000
## Sample size: 1402
## Number of independent variables: 29
## Mtry: 13
## Target node size: 5
## Variable importance mode: impurity
## Splitrule: gini
## OOB prediction error (Brier s.): 0.1090518
## Connection successful!
## R is connected to the H2O cluster:
## H2O cluster uptime: 2 hours 7 minutes
## H2O cluster timezone: Asia/Seoul
## H2O data parsing timezone: UTC
## H2O cluster version:
## H2O cluster version age: 3 months and 6 days
## H2O cluster name: H2O_started_from_R_raymondkim_eou682
## H2O cluster total nodes: 1
## H2O cluster total memory: 3.20 GB
## H2O cluster total cores: 8
## H2O cluster allowed cores: 8
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
## R Version: R version 4.0.5 (2021-03-31)
<- readRDS("Dataset_pre.RDS")
%>% mutate_if(is.character, factor)->Data_auto
# Setting Reference level
$Attrition <- relevel(Data_auto$Attrition, ref = "Yes")
%>% recipe(Attrition~.) %>%
Data_auto step_normalize(all_numeric()) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_corr(all_numeric()) %>% prep()-> h2o_recipe
%>% juice -> Dataset_h2o
# Putting the original dataframe into an h2o format
%>% as.h2o(destination_frame = "h2o_df")->h2o_df
Dataset_h2o ##
| | 0%
|======================================================================| 100%
# Splitting into training, validation and testing sets
<- h2o.splitFrame(h2o_df, c(0.7, 0.15), seed=12)
# Obtaining our three types of sets into three separate values
<- h2o.assign(split_df[[1]], "train")
h2o_train <- h2o.assign(split_df[[2]], "validation")
h2o_validation <- h2o.assign(split_df[[2]], "test")
## Label Type Missing Zeros PosInf NegInf
## 1 Age real 0 0 0 0
## 2 DailyRate real 0 0 0 0
## 3 DistanceFromHome real 0 0 0 0
## 4 HourlyRate real 0 0 0 0
## 5 JobLevel real 0 0 0 0
## 6 MonthlyRate real 0 0 0 0
## 7 NumCompaniesWorked real 0 0 0 0
## 8 PercentSalaryHike real 0 0 0 0
## 9 StockOptionLevel real 0 0 0 0
## 10 TotalWorkingYears real 0 0 0 0
## 11 TrainingTimesLastYear real 0 0 0 0
## 12 YearsAtCompany real 0 0 0 0
## 13 YearsInCurrentRole real 0 0 0 0
## 14 YearsSinceLastPromotion real 0 0 0 0
## 15 YearsWithCurrManager real 0 0 0 0
## 16 Attrition enum 0 838 0 0
## 17 BusinessTravel_Travel_Frequently int 0 825 0 0
## 18 BusinessTravel_Travel_Rarely int 0 283 0 0
## 19 Department_Sales int 0 681 0 0
## 20 Education_below.College int 0 879 0 0
## 21 Education_College int 0 785 0 0
## 22 Education_Doctor int 0 958 0 0
## 23 Education_Master int 0 739 0 0
## 24 EducationField_Life.Sciences int 0 577 0 0
## 25 EducationField_Marketing int 0 889 0 0
## 26 EducationField_Medical int 0 678 0 0
## 27 EducationField_Other int 0 940 0 0
## 28 EducationField_Technical.Degree int 0 904 0 0
## 29 EnvironmentSatisfaction_Low int 0 801 0 0
## 30 EnvironmentSatisfaction_Medium int 0 802 0 0
## 31 EnvironmentSatisfaction_Very.High int 0 687 0 0
## 32 Gender_Male int 0 402 0 0
## 33 JobInvolvement_Low int 0 945 0 0
## 34 JobInvolvement_Medium int 0 736 0 0
## 35 JobInvolvement_Very.High int 0 893 0 0
## 36 JobRole_Human.Resources int 0 960 0 0
## 37 JobRole_Laboratory.Technician int 0 811 0 0
## 38 JobRole_Manager int 0 941 0 0
## 39 JobRole_Manufacturing.Director int 0 885 0 0
## 40 JobRole_Research.Director int 0 944 0 0
## 41 JobRole_Research.Scientist int 0 795 0 0
## 42 JobRole_Sales.Executive int 0 762 0 0
## 43 JobRole_Sales.Representative int 0 930 0 0
## 44 JobSatisfaction_Low int 0 809 0 0
## 45 JobSatisfaction_Medium int 0 793 0 0
## 46 JobSatisfaction_Very.High int 0 679 0 0
## 47 MaritalStatus_Married int 0 539 0 0
## 48 MaritalStatus_Single int 0 672 0 0
## 49 OverTime_Yes int 0 719 0 0
## 50 PerformanceRating_Outstanding int 0 847 0 0
## 51 RelationshipSatisfaction_Low int 0 799 0 0
## 52 RelationshipSatisfaction_Medium int 0 783 0 0
## 53 RelationshipSatisfaction_Very.High int 0 708 0 0
## 54 WorkLifeBalance_Best int 0 886 0 0
## 55 WorkLifeBalance_Better int 0 395 0 0
## 56 WorkLifeBalance_Good int 0 765 0 0
## Min Max Mean Sigma Cardinality
## 1 -2.0784303 2.685350 0.015732536 1.0072595 NA
## 2 -1.7519591 1.722043 -0.006184071 0.9943607 NA
## 3 -1.0113326 2.484078 0.001687752 1.0064210 NA
## 4 -1.7707175 1.676989 -0.013879977 0.9941769 NA
## 5 -0.9502098 2.905634 -0.009546925 0.9905219 NA
## 6 -1.7264474 1.799486 -0.006659469 1.0059270 NA
## 7 -1.0754126 2.542171 0.007798073 1.0042799 NA
## 8 -1.1608358 2.710195 -0.012502504 1.0029059 NA
## 9 -0.9287806 2.612880 -0.029990700 0.9763015 NA
## 10 -1.5239202 3.707345 -0.004939040 1.0006931 NA
## 11 -2.1847628 2.498781 -0.010428912 0.9911376 NA
## 12 -1.2771588 3.788408 -0.009389695 0.9921338 NA
## 13 -1.1788794 3.826003 -0.012229704 0.9933579 NA
## 14 -0.6928449 4.421544 -0.044992894 0.9661122 NA
## 15 -1.1627758 3.583973 -0.005964821 0.9948860 NA
## 16 0.0000000 1.000000 0.156092649 0.3631260 2
## 17 0.0000000 1.000000 0.169184290 0.3751035 NA
## 18 0.0000000 1.000000 0.715005035 0.4516395 NA
## 19 0.0000000 1.000000 0.314199396 0.4644301 NA
## 20 0.0000000 1.000000 0.114803625 0.3189454 NA
## 21 0.0000000 1.000000 0.209466264 0.4071327 NA
## 22 0.0000000 1.000000 0.035246727 0.1844957 NA
## 23 0.0000000 1.000000 0.255790534 0.4365245 NA
## 24 0.0000000 1.000000 0.418932528 0.4936329 NA
## 25 0.0000000 1.000000 0.104733132 0.3063635 NA
## 26 0.0000000 1.000000 0.317220544 0.4656286 NA
## 27 0.0000000 1.000000 0.053373615 0.2248907 NA
## 28 0.0000000 1.000000 0.089627392 0.2857911 NA
## 29 0.0000000 1.000000 0.193353474 0.3951267 NA
## 30 0.0000000 1.000000 0.192346425 0.3943423 NA
## 31 0.0000000 1.000000 0.308157100 0.4619645 NA
## 32 0.0000000 1.000000 0.595166163 0.4911072 NA
## 33 0.0000000 1.000000 0.048338369 0.2145883 NA
## 34 0.0000000 1.000000 0.258811682 0.4382027 NA
## 35 0.0000000 1.000000 0.100704935 0.3010893 NA
## 36 0.0000000 1.000000 0.033232628 0.1793338 NA
## 37 0.0000000 1.000000 0.183282981 0.3870933 NA
## 38 0.0000000 1.000000 0.052366566 0.2228774 NA
## 39 0.0000000 1.000000 0.108761329 0.3114964 NA
## 40 0.0000000 1.000000 0.049345418 0.2166973 NA
## 41 0.0000000 1.000000 0.199395770 0.3997474 NA
## 42 0.0000000 1.000000 0.232628399 0.4227202 NA
## 43 0.0000000 1.000000 0.063444109 0.2438829 NA
## 44 0.0000000 1.000000 0.185297080 0.3887342 NA
## 45 0.0000000 1.000000 0.201409869 0.4012556 NA
## 46 0.0000000 1.000000 0.316213494 0.4652316 NA
## 47 0.0000000 1.000000 0.457200403 0.4984159 NA
## 48 0.0000000 1.000000 0.323262840 0.4679578 NA
## 49 0.0000000 1.000000 0.275931521 0.4472077 NA
## 50 0.0000000 1.000000 0.147029204 0.3543135 NA
## 51 0.0000000 1.000000 0.195367573 0.3966832 NA
## 52 0.0000000 1.000000 0.211480363 0.4085640 NA
## 53 0.0000000 1.000000 0.287009063 0.4525938 NA
## 54 0.0000000 1.000000 0.107754280 0.3102261 NA
## 55 0.0000000 1.000000 0.602215509 0.4896871 NA
## 56 0.0000000 1.000000 0.229607251 0.4207922 NA
# Establish X and Y (Features and Labels)
<- "Attrition"
y <- setdiff(names(h2o_train), y) x
<- h2o.automl(
auto_ml y = y,
x = x,
training_frame = h2o_train,
leaderboard_frame = h2o_validation,
project_name = "Attrition",
max_models = 10,
seed = 1
<- auto_ml@leaderboard
%>% as.data.frame %>% DT::datatable() best_models
# best model을 가져옵니다.
<- as.data.frame(best_models$model_id)[,1]
<- h2o.getModel(grep("StackedEnsemble_BestOfFamily", best_model_id, value=TRUE)[1])
<- h2o.getModel(stacked_ensemble_model@model$metalearner$name)
# explainer <- lime(rnd_train,SEBOF)
# explain_top <- lime::explain(rnd_train[1:5],explainer, n_labels = 2, n_features = 10)
# plot_explanations(explain_top)
<- h2o.getModel(grep("GLM", best_model_id, value = TRUE)[1])
glm <- h2o.getModel(grep("XGBoost", best_model_id, value = TRUE)[1])
# Examine the variable importance of the top XGBoost model
# XGBoost can show the feature importance as oppose to the stack ensemble
h2o.varimp(glm) %>% DT::datatable()
# We can also plot the base learner contributions to the ensemble.
h2o.performance(auto_ml@leader, h2o_test)->performance_automl
## Confusion Matrix (vertical: actual; across: predicted) for max f1 @ threshold = 0.444758160347686:
## No Yes Error Rate
## No 155 6 0.037267 =6/161
## Yes 12 23 0.342857 =12/35
## Totals 167 29 0.091837 =18/196
h2o.F1(performance_automl, thresholds = .5)
## Warning in h2o.find_row_by_threshold(o, t): Could not find exact threshold: 0.5
## for this set of metrics; using closest threshold found: 0.50628906742409. Run
## `h2o.predict` and apply your desired threshold on a probability column.
## [[1]]
## [1] 0.6315789
h2o.accuracy(performance_automl, thresholds = .5)
## Warning in h2o.find_row_by_threshold(o, t): Could not find exact threshold: 0.5
## for this set of metrics; using closest threshold found: 0.50628906742409. Run
## `h2o.predict` and apply your desired threshold on a probability column.
## [[1]]
## [1] 0.8928571
h2o.recall(performance_automl, thresholds = .5)
## Warning in h2o.find_row_by_threshold(o, t): Could not find exact threshold: 0.5
## for this set of metrics; using closest threshold found: 0.50628906742409. Run
## `h2o.predict` and apply your desired threshold on a probability column.
## [[1]]
## [1] 0.5142857
## [1] 0.9110914
plot(performance_automl, type="roc")
<- h2o.saveModel(auto_ml@leader, path=getwd(), force=TRUE)
model_path## [1] "/Users/raymondkim/Rproject/Turnover/StackedEnsemble_BestOfFamily_AutoML_20210826_123312"
\(Y = aX+b\)
predictive people analytics를 하려는 이유는
긍정적인 방향으로 변화시키고 싶은 Y에 영향을 미치는, 변화 가능한 X를 찾는 것인데,
앞선 분석들은 가능한 모든 X를 넣고 예측만 잘 하려고 하는 과정으로 볼 수 있습니다.
퇴직 예측분석을 통해 변화시키고 싶은 것은 앞서도 살펴보았지만,아래와 같을 것입니다.
No | Y |
1 | 핵심인재 Retention |
2 | workforce planning |
3 | 고용 전 적합성 판단 |
4 | 교육 및 훈련계획 수립 |
No | X | intervention |
1 | Years in Current Role | 사내 부서 이동 |
2 | Years with current Manager | 사내 부서 이동, 조직장 보임 |
3 | Over Time | 연장 근로 제한, 재택근무 |
4 | WorkLIfe Balance | 연장 근로 제한, 재택근무 |
5 | Environment Satisfaction | 근로 환경 개선 |
6 | Distance From Home | 재택근무, 거점오피스 |
7 | Business Travel | VR/영상회의 시스템 구축 |
8 | Department | 사내 부서 이동 |
9 | Training Time Last Year | 교육체계 수립 및 운영 |
10 | Years Since Last Promotion | 승진 |
11 | Years with current manager | 사내 부서 이동 |
<- readRDS("Dataset_pre.RDS")
%>% colnames()
Dataset ## [1] "Age" "Attrition"
## [3] "BusinessTravel" "DailyRate"
## [5] "Department" "DistanceFromHome"
## [7] "Education" "EducationField"
## [9] "EnvironmentSatisfaction" "Gender"
## [11] "HourlyRate" "JobInvolvement"
## [13] "JobLevel" "JobRole"
## [15] "JobSatisfaction" "MaritalStatus"
## [17] "MonthlyRate" "NumCompaniesWorked"
## [19] "OverTime" "PercentSalaryHike"
## [21] "PerformanceRating" "RelationshipSatisfaction"
## [23] "StockOptionLevel" "TotalWorkingYears"
## [25] "TrainingTimesLastYear" "WorkLifeBalance"
## [27] "YearsAtCompany" "YearsInCurrentRole"
## [29] "YearsSinceLastPromotion" "YearsWithCurrManager"
%>% dplyr::select(Attrition,YearsInCurrentRole,OverTime, WorkLifeBalance,
EnvironmentSatisfaction,DistanceFromHome, Department, ->Dataset_HR
%>% mutate_if(is.character, factor)-> Data_HR
# Setting Reference level
$Attrition <- relevel(Data_auto$Attrition, ref = "Yes")
%>% recipe(Attrition~.) %>%
Data_HR step_normalize(all_numeric()) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_corr(all_numeric()) %>% prep()-> h2o_recipe_re
%>% juice -> Dataset_h2o_re
# Putting the original dataframe into an h2o format
%>% as.h2o(destination_frame = "h2o_df_re")->h2o_df_re
Dataset_h2o_re ##
| | 0%
|======================================================================| 100%
# Splitting into training, validation and testing sets
<- h2o.splitFrame(h2o_df_re, c(0.7, 0.15), seed=12)
# Obtaining our three types of sets into three separate values
<- h2o.assign(split_df_re[[1]], "train")
h2o_train_re <- h2o.assign(split_df_re[[2]], "validation")
h2o_validation_re <- h2o.assign(split_df_re[[2]], "test")
## Label Type Missing Zeros PosInf NegInf
## 1 YearsInCurrentRole real 0 0 0 0
## 2 DistanceFromHome real 0 0 0 0
## 3 TrainingTimesLastYear real 0 0 0 0
## 4 YearsSinceLastPromotion real 0 0 0 0
## 5 YearsWithCurrManager real 0 0 0 0
## 6 Attrition enum 0 838 0 0
## 7 OverTime_Yes int 0 719 0 0
## 8 WorkLifeBalance_Best int 0 886 0 0
## 9 WorkLifeBalance_Better int 0 395 0 0
## 10 WorkLifeBalance_Good int 0 765 0 0
## 11 EnvironmentSatisfaction_Low int 0 801 0 0
## 12 EnvironmentSatisfaction_Medium int 0 802 0 0
## 13 EnvironmentSatisfaction_Very.High int 0 687 0 0
## 14 Department_Research...Development int 0 349 0 0
## Min Max Mean Sigma Cardinality
## 1 -1.1788794 3.826003 -0.012229704 0.9933579 NA
## 2 -1.0113326 2.484078 0.001687752 1.0064210 NA
## 3 -2.1847628 2.498781 -0.010428912 0.9911376 NA
## 4 -0.6928449 4.421544 -0.044992894 0.9661122 NA
## 5 -1.1627758 3.583973 -0.005964821 0.9948860 NA
## 6 0.0000000 1.000000 0.156092649 0.3631260 2
## 7 0.0000000 1.000000 0.275931521 0.4472077 NA
## 8 0.0000000 1.000000 0.107754280 0.3102261 NA
## 9 0.0000000 1.000000 0.602215509 0.4896871 NA
## 10 0.0000000 1.000000 0.229607251 0.4207922 NA
## 11 0.0000000 1.000000 0.193353474 0.3951267 NA
## 12 0.0000000 1.000000 0.192346425 0.3943423 NA
## 13 0.0000000 1.000000 0.308157100 0.4619645 NA
## 14 0.0000000 1.000000 0.648539778 0.4776669 NA
# Establish X and Y (Features and Labels)
<- "Attrition"
y1 <- setdiff(names(h2o_train_re), y) x1
<- h2o.automl(
automl y = y1,
x = x1,
training_frame = h2o_train_re,
validation_frame = h2o_validation_re,
project_name = "Attrition",
max_models = 10,
seed = 1
# Best models
<- automl@leaderboard
best_models %>% as.data.frame %>% DT::datatable() best_models
h2o.performance(automl@leader, h2o_test_re)->performance_automl_re
## Confusion Matrix (vertical: actual; across: predicted) for max f1 @ threshold = 0.261791635152933:
## No Yes Error Rate
## No 146 15 0.093168 =15/161
## Yes 11 24 0.314286 =11/35
## Totals 157 39 0.132653 =26/196
h2o.F1(performance_automl_re, thresholds = .5)
## Warning in h2o.find_row_by_threshold(o, t): Could not find exact threshold: 0.5
## for this set of metrics; using closest threshold found: 0.522384473931713. Run
## `h2o.predict` and apply your desired threshold on a probability column.
## [[1]]
## [1] 0.2
h2o.accuracy(performance_automl_re, thresholds = .5)
## Warning in h2o.find_row_by_threshold(o, t): Could not find exact threshold: 0.5
## for this set of metrics; using closest threshold found: 0.522384473931713. Run
## `h2o.predict` and apply your desired threshold on a probability column.
## [[1]]
## [1] 0.8367347
h2o.recall(performance_automl_re, thresholds = .5)
## Warning in h2o.find_row_by_threshold(o, t): Could not find exact threshold: 0.5
## for this set of metrics; using closest threshold found: 0.522384473931713. Run
## `h2o.predict` and apply your desired threshold on a probability column.
## [[1]]
## [1] 0.1142857
## [1] 0.8436557
plot(performance_automl_re, type="roc")
# best model을 가져옵니다.
<- as.data.frame(best_models$model_id)[,1]
best_model_id2 <- h2o.getModel(grep("GLM", best_model_id2, value = TRUE)[1])
glm_re h2o.varimp(glm_re) %>% DT::datatable()
<- readRDS("Dataset_pre.RDS")
%>% dplyr::select(PerformanceRating) %>% unique Dataset
## # A tibble: 2 x 1
## PerformanceRating
## <chr>
## 1 Excellent
## 2 Outstanding
%>% filter(PerformanceRating %in% "Outstanding") %>% nrow Dataset
## [1] 212
%>% filter(PerformanceRating %in% "Outstanding") %>% dplyr::select(Attrition) %>% table Dataset
## .
## No Yes
## 175 37
# Diversity는 비슷하게 구성되어 있음
%>% filter(PerformanceRating %in% "Outstanding")->Dataset_High
%>% mutate_if(is.character, factor)-> Data_High Dataset_High
# Setting Reference level
$Attrition <- relevel(Data_High$Attrition, ref = "Yes")
%>% dplyr::select(-PerformanceRating) %>%
Data_High ::select(Attrition,YearsInCurrentRole,OverTime, WorkLifeBalance,
EnvironmentSatisfaction,DistanceFromHome, Department, ->Data_High
%>% recipe(Attrition~.) %>%
Data_High step_normalize(all_numeric()) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_corr(all_numeric()) %>% prep()-> h2o_recipe_High
%>% juice -> Dataset_h2o_High
# Putting the original dataframe into an h2o format
%>% as.h2o(destination_frame = "h2o_df")->h2o_df_High Dataset_h2o_High
| | 0%
|======================================================================| 100%
# Splitting into training, validation and testing sets
<- h2o.splitFrame(h2o_df_High, c(0.7, 0.15), seed=12)
# Obtaining our three types of sets into three separate values
<- h2o.assign(split_df_High[[1]], "train")
h2o_train_High <- h2o.assign(split_df_High[[2]], "validation")
h2o_validation_High <- h2o.assign(split_df_High[[2]], "test")
## Label Type Missing Zeros PosInf NegInf
## 1 YearsInCurrentRole real 0 0 0 0
## 2 DistanceFromHome real 0 0 0 0
## 3 TrainingTimesLastYear real 0 0 0 0
## 4 YearsSinceLastPromotion real 0 0 0 0
## 5 YearsWithCurrManager real 0 0 0 0
## 6 Attrition enum 0 135 0 0
## 7 OverTime_Yes int 0 110 0 0
## 8 WorkLifeBalance_Best int 0 140 0 0
## 9 WorkLifeBalance_Better int 0 62 0 0
## 10 WorkLifeBalance_Good int 0 124 0 0
## 11 EnvironmentSatisfaction_Low int 0 129 0 0
## 12 EnvironmentSatisfaction_Medium int 0 127 0 0
## 13 EnvironmentSatisfaction_Very.High int 0 114 0 0
## 14 Department_Research...Development int 0 48 0 0
## 15 Department_Sales int 0 116 0 0
## Min Max Mean Sigma Cardinality
## 1 -1.1954805 3.396897 -0.01832029 0.9943792 NA
## 2 -1.0107498 2.250017 -0.03340425 0.9781307 NA
## 3 -2.1589112 2.583982 -0.02761118 1.0003229 NA
## 4 -0.6732738 3.673956 0.01916470 0.9943281 NA
## 5 -1.2019820 3.242556 -0.01114584 0.9985143 NA
## 6 0.0000000 1.000000 0.14556962 0.3537956 2
## 7 0.0000000 1.000000 0.30379747 0.4613586 NA
## 8 0.0000000 1.000000 0.11392405 0.3187292 NA
## 9 0.0000000 1.000000 0.60759494 0.4898387 NA
## 10 0.0000000 1.000000 0.21518987 0.4122607 NA
## 11 0.0000000 1.000000 0.18354430 0.3883430 NA
## 12 0.0000000 1.000000 0.19620253 0.3983862 NA
## 13 0.0000000 1.000000 0.27848101 0.4496767 NA
## 14 0.0000000 1.000000 0.69620253 0.4613586 NA
## 15 0.0000000 1.000000 0.26582278 0.4431750 NA
# Establish X and Y (Features and Labels)
<- "Attrition"
y <- setdiff(names(h2o_train_High), y) x
<- h2o.automl(
automl_high y = y,
x = x,
training_frame = h2o_train_High,
validation_frame = h2o_validation_High,
project_name = "Attrition",
max_models = 10,
seed = 1
# Best models
<- automl_high@leaderboard
best_models_High %>% as.data.frame %>% DT::datatable() best_models_High
h2o.performance(automl_high@leader, h2o_test_High)->performance_automl_High
# max f1 @ threshold = 0.479166666666667:
## Confusion Matrix (vertical: actual; across: predicted) for max f1 @ threshold = 0.479166666666667:
## No Yes Error Rate
## No 20 0 0.000000 =0/20
## Yes 0 7 0.000000 =0/7
## Totals 20 7 0.000000 =0/27
## threshold f1
## 1 0.873958333 0.2500000
## 2 0.770833333 0.4444444
## 3 0.729166667 0.6000000
## 4 0.645833333 0.7272727
## 5 0.583333333 0.8333333
## 6 0.500000000 0.9230769
## 7 0.479166667 1.0000000
## 8 0.423611111 0.9333333
## 9 0.187500000 0.8750000
## 10 0.135416667 0.7777778
## 11 0.130555555 0.7368421
## 12 0.114583333 0.7000000
## 13 0.062500000 0.6666667
## 14 0.041666667 0.6086957
## 15 0.028472222 0.5833333
## 16 0.020833333 0.4827586
## 17 0.010416667 0.4666667
## 18 0.006944444 0.4516129
## 19 0.000000000 0.4117647
## threshold accuracy
## 1 0.873958333 0.7777778
## 2 0.770833333 0.8148148
## 3 0.729166667 0.8518519
## 4 0.645833333 0.8888889
## 5 0.583333333 0.9259259
## 6 0.500000000 0.9629630
## 7 0.479166667 1.0000000
## 8 0.423611111 0.9629630
## 9 0.187500000 0.9259259
## 10 0.135416667 0.8518519
## 11 0.130555555 0.8148148
## 12 0.114583333 0.7777778
## 13 0.062500000 0.7407407
## 14 0.041666667 0.6666667
## 15 0.028472222 0.6296296
## 16 0.020833333 0.4444444
## 17 0.010416667 0.4074074
## 18 0.006944444 0.3703704
## 19 0.000000000 0.2592593
## threshold tpr
## 1 0.873958333 0.1428571
## 2 0.770833333 0.2857143
## 3 0.729166667 0.4285714
## 4 0.645833333 0.5714286
## 5 0.583333333 0.7142857
## 6 0.500000000 0.8571429
## 7 0.479166667 1.0000000
## 8 0.423611111 1.0000000
## 9 0.187500000 1.0000000
## 10 0.135416667 1.0000000
## 11 0.130555555 1.0000000
## 12 0.114583333 1.0000000
## 13 0.062500000 1.0000000
## 14 0.041666667 1.0000000
## 15 0.028472222 1.0000000
## 16 0.020833333 1.0000000
## 17 0.010416667 1.0000000
## 18 0.006944444 1.0000000
## 19 0.000000000 1.0000000
## threshold precision
## 1 0.873958333 1.0000000
## 2 0.770833333 1.0000000
## 3 0.729166667 1.0000000
## 4 0.645833333 1.0000000
## 5 0.583333333 1.0000000
## 6 0.500000000 1.0000000
## 7 0.479166667 1.0000000
## 8 0.423611111 0.8750000
## 9 0.187500000 0.7777778
## 10 0.135416667 0.6363636
## 11 0.130555555 0.5833333
## 12 0.114583333 0.5384615
## 13 0.062500000 0.5000000
## 14 0.041666667 0.4375000
## 15 0.028472222 0.4117647
## 16 0.020833333 0.3181818
## 17 0.010416667 0.3043478
## 18 0.006944444 0.2916667
## 19 0.000000000 0.2592593
## [1] 1
plot(performance_automl_High, type="roc")
# best model을 가져옵니다.
<- as.data.frame(best_models_High$model_id)[,1]
<- h2o.getModel(grep("DRF", best_model_id_High, value = TRUE)[1])
h2o.varimp(DRF_re) %>% DT::datatable()
Group Level로 나누면 또 다른 변수가 나옴을 확인했습니다.
이제, 이렇게 얻은 분석결과는 조직의 상황과 맥락에 맞춰 해석하고 그에 맞는 Internvention을 기획하여 실행하면 됩니다!
## Are you sure you want to shutdown the H2O instance running at http://localhost:54321/ (Y/N)?
