A XGBoost risk model via feature selection and Bayesian hyper-parameter optimization

Yan Wang,Xuelei Sherry Ni

DOI: https://doi.org/10.48550/arXiv.1901.08433

2019-01-24

Abstract:This paper aims to explore models based on the extreme gradient boosting (XGBoost) approach for business risk classification. Feature selection (FS) algorithms and hyper-parameter optimizations are simultaneously considered during model training. The five most commonly used FS methods including weight by Gini, weight by Chi-square, hierarchical variable clustering, weight by correlation, and weight by information are applied to alleviate the effect of redundant features. Two hyper-parameter optimization approaches, random search (RS) and Bayesian tree-structured Parzen Estimator (TPE), are applied in XGBoost. The effect of different FS and hyper-parameter optimization methods on the model performance are investigated by the Wilcoxon Signed Rank Test. The performance of XGBoost is compared to the traditionally utilized logistic regression (LR) model in terms of classification accuracy, area under the curve (AUC), recall, and F1 score obtained from the 10-fold cross validation. Results show that hierarchical clustering is the optimal FS method for LR while weight by Chi-square achieves the best performance in XG-Boost. Both TPE and RS optimization in XGBoost outperform LR significantly. TPE optimization shows a superiority over RS since it results in a significantly higher accuracy and a marginally higher AUC, recall and F1 score. Furthermore, XGBoost with TPE tuning shows a lower variability than the RS method. Finally, the ranking of feature importance based on XGBoost enhances the model interpretation. Therefore, XGBoost with Bayesian TPE hyper-parameter optimization serves as an operative while powerful approach for business risk modeling.

Machine Learning

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to improve the performance of the Extreme Gradient Boosting (XGBoost) model in commercial risk classification through feature selection (FS) algorithms and hyper - parameter optimization methods. Specifically, the researchers hope to explore the following issues: 1. **The impact of different feature selection methods on the performance of Logistic Regression (LR) and XGBoost models**: By comparing the impact of five commonly - used feature selection methods (based on Gini index, chi - square test, hierarchical clustering, correlation weight, information gain ratio) on model performance, find the optimal feature selection method corresponding to each model. 2. **The impact of hyper - parameter optimization methods on the performance of XGBoost model**: Study the impact of two hyper - parameter optimization methods, Random Search (RS) and Bayesian Tree - Structured Parzen Estimator (TPE), on the performance of XGBoost model, and determine which method is better. 3. **Whether the XGBoost method is more powerful than the traditionally - used Logistic Regression (LR) model in commercial risk prediction**: By comparing the performance of XGBoost and LR models in terms of classification accuracy, AUC value, recall rate (Recall), F1 - score, etc., evaluate the advantages of the XGBoost model. 4. **Identification of important features based on the used data set**: By ranking the feature importance, enhance the interpretability of the model and identify the features that are crucial for risk prediction. In summary, this paper aims to explore an efficient and comprehensive method for constructing commercial risk models through systematic experimental design and analysis, especially by using XGBoost combined with effective feature selection and hyper - parameter optimization techniques to improve the predictive ability of the model.

A XGBoost risk model via feature selection and Bayesian hyper-parameter optimization

A boosted decision tree approach using Bayesian hyper-parameter optimization for credit scoring

Corporate Financial Risk Identification and Operation Control Analysis for XGBoost Modeling

A Modified Bayesian Optimization based Hyper-Parameter Tuning Approach for Extreme Gradient Boosting

Predicting class-imbalanced business risk using resampling, regularization, and model ensembling algorithms

XGBoost Optimized by Adaptive Particle Swarm Optimization for Credit Scoring

Predicting Chinese stock market using XGBoost multi-objective optimization with optimal weighting

Risk-Controlling Model Selection via Guided Bayesian Optimization

Bankruptcy Prediction using the XGBoost Algorithm and Variable Importance Feature Engineering

Advanced hyperparameter optimization for improved spatial prediction of shallow landslides using extreme gradient boosting (XGBoost)

Generalized XGBoost Method

An improved gradient boosting tree algorithm for financial risk management

A Hybrid XGBoost-MLP Model for Credit Risk Assessment on Digital Supply Chain Finance

Modified Genetic Algorithm for Feature Selection and Hyper Parameter Optimization: Case of XGBoost in Spam Prediction

Prediction and analysis of train arrival delay based on XGBoost and Bayesian optimization

Penalized semiparametric Cox regression model on XGBoost and random survival forests

The impact of Bayesian optimization on feature selection

Design of Efficient Financial Big Data Processing and Analysis System Using Machine Learning Technology

Hyperparameter Optimization for Machine Learning Models Based on Bayesian Optimization

Assessment of landslide susceptibility mapping based on Bayesian hyperparameter optimization: A comparison between logistic regression and random forest

CatBoost model with synthetic features in application to loan risk assessment of small businesses