Using the IBM HR Analytics Employee Attrition & Performance dataset, I built interpretable machine learning models to understand why employees leave and to help HR teams identify at-risk staff. This project combines EDA, class imbalance handling, and predictive modeling with a focus on interpretability.
Problem
Employee turnover is costly, and HR teams need tools to identify risk factors and predict attrition.
The question: Which factors most strongly influence attrition, and how can we build a model that balances accuracy with interpretability?
Approach
- EDA – Explored distributions of satisfaction, overtime, department, tenure, and job level with respect to attrition.
- Preprocessing –
- One-hot encoding for categorical features
- SMOTE to handle class imbalance
- Binning (Age, YearsAtCompany)
- Rare label encoding for JobRole
- Modeling –
- Logistic Regression (baseline)
- Random Forest
- XGBoost
- Hyperparameter tuning with RandomizedSearchCV
- Evaluation –
- Confusion Matrix, ROC Curve, AUC
- Classification Report (Precision, Recall, F1)
Results
| Model | Accuracy | Recall (Attrition) | AUC |
|---|---|---|---|
| Logistic Regression | 0.86 | 0.26 | 0.65 |
| Random Forest (SMOTE) | 0.82 | 0.32 | 0.74 |
| XGBoost (SMOTE) | 0.82 | 0.40 | 0.78 |
| Tuned XGBoost (SMOTE) | 0.81 | 0.34 | 0.77 |
- Best-performing model: XGBoost with SMOTE
- Most predictive features: OverTime, StockOptionLevel, JobSatisfaction, YearsAtCompany
Impact
- Addressed class imbalance using SMOTE, which significantly improved recall on the minority (attrition) class.
- Showed that behavioral features (e.g., job satisfaction, overtime, work-life balance) are critical for predicting attrition.
- Delivered an interpretable workflow that HR teams could apply to support retention strategies.
Skills & Tools
- Python (pandas, NumPy, scikit-learn, XGBoost, imbalanced-learn)
- EDA and feature engineering
- Handling class imbalance (SMOTE)
- Model evaluation (AUC, recall, precision, F1)
- Hyperparameter tuning (RandomizedSearchCV)
- Visualization (matplotlib, seaborn)
