Tri-XGBoost model improved by BLSmote-ENN: an interpretable semi-supervised approach for addressing bankruptcy prediction
Salima Smiti,Makram Soui,Khaled Ghedira
DOI: https://doi.org/10.1007/s10115-024-02067-w
IF: 2.7
2024-04-23
Knowledge and Information Systems
Abstract:Bankruptcy prediction is considered one of the most important research topics in the field of finance and accounting. The rapid increase of data science, artificial intelligence, and machine learning has led researchers to build an accurate bankruptcy prediction model. Recent studies show that ensemble methods perform better than traditional machine learning models for predicting corporate failure, especially with highly imbalanced datasets. However, the black box property of these techniques remains challenging to interpret the result and generate corporate classes without any explanation. To this end, we propose to build an accurate and interpretable classification model that generates a set of prediction rules for output. Tri-eXtreme Gradient Boosting (Tri-XGBoost), a semi-supervised technique, is recommended in this paper. The proposed method combines Borderline-Smote (BLSmote) based on Edited Nearest Neighbor (ENN) sampling techniques with three different XGBoost methods as weak classifiers (gbtree, gblinear, and dart). First, the resampling techniques are used to produce more representative synthetic data and balance the distribution of the datasets. To this end, BLSmote is applied to increase the proportion of instances in the minority class (bankrupt data). Then, ENN is used to eliminate the noisy samples from both classes. In addition, the most crucial features that increase predictive accuracy are chosen using XGBoost. Finally, in order to make the model more understandable for both applicants and experts, our result is presented as "IF–THEN" rules. Our proposed model is validated using the imbalanced Polish and Taiwan bankruptcy datasets. Our obtained results demonstrate that our suggested model performs better than the existing models based on the area under the ROC curve (AUC), F1-score, and G-mean performance measures. Our proposed model significantly improves classification accuracy, which is greater than 95% for Polish datasets and more than 93% for Taiwanese dataset in terms of AUC, G-mean and F1-score.
computer science, information systems, artificial intelligence