A XGBoost risk model via feature selection and Bayesian hyper-parameter optimization

Yan Wang,Xuelei Sherry Ni
DOI: https://doi.org/10.48550/arXiv.1901.08433
2019-01-24
Abstract:This paper aims to explore models based on the extreme gradient boosting (XGBoost) approach for business risk classification. Feature selection (FS) algorithms and hyper-parameter optimizations are simultaneously considered during model training. The five most commonly used FS methods including weight by Gini, weight by Chi-square, hierarchical variable clustering, weight by correlation, and weight by information are applied to alleviate the effect of redundant features. Two hyper-parameter optimization approaches, random search (RS) and Bayesian tree-structured Parzen Estimator (TPE), are applied in XGBoost. The effect of different FS and hyper-parameter optimization methods on the model performance are investigated by the Wilcoxon Signed Rank Test. The performance of XGBoost is compared to the traditionally utilized logistic regression (LR) model in terms of classification accuracy, area under the curve (AUC), recall, and F1 score obtained from the 10-fold cross validation. Results show that hierarchical clustering is the optimal FS method for LR while weight by Chi-square achieves the best performance in XG-Boost. Both TPE and RS optimization in XGBoost outperform LR significantly. TPE optimization shows a superiority over RS since it results in a significantly higher accuracy and a marginally higher AUC, recall and F1 score. Furthermore, XGBoost with TPE tuning shows a lower variability than the RS method. Finally, the ranking of feature importance based on XGBoost enhances the model interpretation. Therefore, XGBoost with Bayesian TPE hyper-parameter optimization serves as an operative while powerful approach for business risk modeling.
Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to improve the performance of the Extreme Gradient Boosting (XGBoost) model in commercial risk classification through feature selection (FS) algorithms and hyper - parameter optimization methods. Specifically, the researchers hope to explore the following issues: 1. **The impact of different feature selection methods on the performance of Logistic Regression (LR) and XGBoost models**: By comparing the impact of five commonly - used feature selection methods (based on Gini index, chi - square test, hierarchical clustering, correlation weight, information gain ratio) on model performance, find the optimal feature selection method corresponding to each model. 2. **The impact of hyper - parameter optimization methods on the performance of XGBoost model**: Study the impact of two hyper - parameter optimization methods, Random Search (RS) and Bayesian Tree - Structured Parzen Estimator (TPE), on the performance of XGBoost model, and determine which method is better. 3. **Whether the XGBoost method is more powerful than the traditionally - used Logistic Regression (LR) model in commercial risk prediction**: By comparing the performance of XGBoost and LR models in terms of classification accuracy, AUC value, recall rate (Recall), F1 - score, etc., evaluate the advantages of the XGBoost model. 4. **Identification of important features based on the used data set**: By ranking the feature importance, enhance the interpretability of the model and identify the features that are crucial for risk prediction. In summary, this paper aims to explore an efficient and comprehensive method for constructing commercial risk models through systematic experimental design and analysis, especially by using XGBoost combined with effective feature selection and hyper - parameter optimization techniques to improve the predictive ability of the model.