Resampling Techniques Study on Class Imbalance Problem in Credit Risk Prediction

Zixue Zhao,Tianxiang Cui,Shusheng Ding,Jiawei Li,Anthony Graham Bellotti

DOI: https://doi.org/10.3390/math12050701

IF: 2.4

2024-02-29

Mathematics

Abstract:Credit risk prediction heavily relies on historical data provided by financial institutions. The goal is to identify commonalities among defaulting users based on existing information. However, data on defaulters is often limited, leading to a concentration of credit data where positive samples (defaults) are significantly fewer than negative samples (nondefaults). It poses a serious challenge known as the class imbalance problem, which can substantially impact data quality and predictive model effectiveness. To address the problem, various resampling techniques have been proposed and studied extensively. However, despite ongoing research, there is no consensus on the most effective technique. The choice of resampling technique is closely related to the dataset size and imbalance ratio, and its effectiveness varies across different classifiers. Moreover, there is a notable gap in research concerning suitable techniques for extremely imbalanced datasets. Therefore, this study aims to compare popular resampling techniques across different datasets and classifiers while also proposing a novel hybrid sampling method tailored for extremely imbalanced datasets. Our experimental results demonstrate that this new technique significantly enhances classifier predictive performance, shedding light on effective strategies for managing the class imbalance problem in credit risk prediction.

mathematics

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper primarily aims to address the issue of class imbalance (CI) in credit risk prediction. Specifically, credit risk prediction relies on historical data provided by financial institutions to identify common characteristics of defaulting users. However, in actual data, the number of default samples (positive samples) is much smaller than the number of non-default samples (negative samples), a phenomenon known as the class imbalance problem. The class imbalance problem can severely affect data quality and the effectiveness of prediction models. To solve this problem, researchers have proposed various resampling techniques and conducted extensive studies. Despite the large amount of research, there is still no consensus on the most effective technique. Additionally, there are gaps in existing research when it comes to extremely imbalanced datasets. Therefore, this study aims to: 1. Compare popular resampling techniques across different datasets and classifiers. 2. Propose a new hybrid resampling method (SH-SENN) specifically designed to handle extremely imbalanced datasets. Experimental results show that this new method significantly improves the predictive performance of classifiers, providing an effective strategy for managing the class imbalance problem in credit risk prediction.

Resampling Techniques Study on Class Imbalance Problem in Credit Risk Prediction

Impact of resampling methods and classification models on the imbalanced credit scoring problems

Application of Big Data Unbalanced Classification Algorithm in Credit Risk Analysis of Insurance Companies

Enhancing Data Quality through Self-learning on Imbalanced Financial Risk Data

Resampling ensemble model based on data distribution for imbalanced credit risk evaluation in P2P lending

Evaluating resampling methods on a real-life highly imbalanced online credit card payments dataset

A ResNet-LSTM Based Credit Scoring Approach for Imbalanced Data

Classification of Imbalanced Credit scoring data sets Based on Ensemble Method with the Weighted-Hybrid-Sampling

Value-Aware Resampling and Loss for Imbalanced Classification

Resampling approach for imbalanced data classification based on class instance density per feature value intervals

A Novel Multi-Stage Ensemble Model With a Hybrid Genetic Algorithm for Credit Scoring on Imbalanced Data

An Empirical Study on the Joint Impact of Feature Selection and Data Re-sampling on Imbalance Classification

A DBN-based resampling SVM ensemble learning paradigm for credit classification with imbalanced data

An Empirical Study on the Joint Impact of Feature Selection and Data Resampling on Imbalance Classification

Unbalanced Credit Card Fraud Detection Data: A Machine Learning-Oriented Comparative Study of Balancing Techniques

Selecting the Suitable Resampling Strategy for Imbalanced Data Classification Regarding Dataset Properties. An Approach Based on Association Models

A cluster impurity-based hybrid resampling for imbalanced classification problems

Selecting the suitable resampling strategy for imbalanced data classification regarding dataset properties

Enhancing Supervised Model Performance in Credit Risk Classification Using Sampling Strategies and Feature Ranking

Credit risk assessment for unbalanced datasets based on data mining, artificial neural network and support vector machines

An empirical evaluation of sampling methods for the classification of imbalanced data