Resampling Techniques Study on Class Imbalance Problem in Credit Risk Prediction

Zixue Zhao,Tianxiang Cui,Shusheng Ding,Jiawei Li,Anthony Graham Bellotti
DOI: https://doi.org/10.3390/math12050701
IF: 2.4
2024-02-29
Mathematics
Abstract:Credit risk prediction heavily relies on historical data provided by financial institutions. The goal is to identify commonalities among defaulting users based on existing information. However, data on defaulters is often limited, leading to a concentration of credit data where positive samples (defaults) are significantly fewer than negative samples (nondefaults). It poses a serious challenge known as the class imbalance problem, which can substantially impact data quality and predictive model effectiveness. To address the problem, various resampling techniques have been proposed and studied extensively. However, despite ongoing research, there is no consensus on the most effective technique. The choice of resampling technique is closely related to the dataset size and imbalance ratio, and its effectiveness varies across different classifiers. Moreover, there is a notable gap in research concerning suitable techniques for extremely imbalanced datasets. Therefore, this study aims to compare popular resampling techniques across different datasets and classifiers while also proposing a novel hybrid sampling method tailored for extremely imbalanced datasets. Our experimental results demonstrate that this new technique significantly enhances classifier predictive performance, shedding light on effective strategies for managing the class imbalance problem in credit risk prediction.
mathematics
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper primarily aims to address the issue of class imbalance (CI) in credit risk prediction. Specifically, credit risk prediction relies on historical data provided by financial institutions to identify common characteristics of defaulting users. However, in actual data, the number of default samples (positive samples) is much smaller than the number of non-default samples (negative samples), a phenomenon known as the class imbalance problem. The class imbalance problem can severely affect data quality and the effectiveness of prediction models. To solve this problem, researchers have proposed various resampling techniques and conducted extensive studies. Despite the large amount of research, there is still no consensus on the most effective technique. Additionally, there are gaps in existing research when it comes to extremely imbalanced datasets. Therefore, this study aims to: 1. Compare popular resampling techniques across different datasets and classifiers. 2. Propose a new hybrid resampling method (SH-SENN) specifically designed to handle extremely imbalanced datasets. Experimental results show that this new method significantly improves the predictive performance of classifiers, providing an effective strategy for managing the class imbalance problem in credit risk prediction.