Abstract:In numerous classification problems, class distribution is not balanced. For example, positive examples are rare in the fields of disease diagnosis and credit card fraud detection. General machine learning methods are known to be suboptimal for such imbalanced classification. One popular solution is to balance training data by oversampling the underrepresented (or undersampling the overrepresented) classes before applying machine learning algorithms. However, despite its popularity, the effectiveness of sampling has not been rigorously and comprehensively evaluated. This study assessed combinations of seven sampling methods and eight machine learning classifiers (56 varieties in total) using 31 datasets with varying degrees of imbalance. We used the areas under the precision-recall curve (AUPRC) and receiver operating characteristics curve (AUROC) as the performance measures. The AUPRC is known to be more informative for imbalanced classification than the AUROC. We observed that sampling significantly changed the performance of the classifier (paired t-tests P < 0.05) only for few cases (12.2% in AUPRC and 10.0% in AUROC). Surprisingly, sampling was more likely to reduce rather than improve the classification performance. Moreover, the adverse effects of sampling were more pronounced in AUPRC than in AUROC. Among the sampling methods, undersampling performed worse than others. Also, sampling was more effective for improving linear classifiers. Most importantly, we did not need sampling to obtain the optimal classifier for most of the 31 datasets. In addition, we found two interesting examples in which sampling significantly reduced AUPRC while significantly improving AUROC (paired t-tests P < 0.05). In conclusion, the applicability of sampling is limited because it could be ineffective or even harmful. Furthermore, the choice of the performance measure is crucial for decision making. Our results provide valuable insights into the effect and characteristics of sampling for imbalanced classification.

The data sampling effect on financial distress prediction by single and ensemble learning techniques

CUS-heterogeneous ensemble-based financial distress prediction for imbalanced dataset with ensemble feature selection

Financial Fraud Detection: a New Ensemble Learning Approach for Imbalanced Data.

Financial distress prediction with optimal decision trees based on the optimal sampling probability

Resampling Techniques Study on Class Imbalance Problem in Credit Risk Prediction

Class‐imbalanced financial distress prediction with machine learning: Incorporating financial, management, textual, and social responsibility features into index system

Class-imbalanced dynamic financial distress prediction based on Adaboost-SVM ensemble combined with SMOTE and time weighting

Performance assessment of ensemble learning systems in financial data classification

A two-stage case-based reasoning driven classification paradigm for financial distress prediction with missing and imbalanced data

Bankruptcy prediction using optimal ensemble models under balanced and imbalanced data

Intelligent Model for Enhancing the Bankruptcy Prediction with Imbalanced Data Using Oversampling and CatBoost

The effect of feature extraction and data sampling on credit card fraud detection

Novel feature selection methods to financial distress prediction

Enhancing Supervised Model Performance in Credit Risk Classification Using Sampling Strategies and Feature Ranking

Dynamic forecasting of financial distress: the hybrid use of incremental bagging and genetic algorithm—empirical study of Chinese listed corporations

Empirical Analysis of Ensemble Learning for Imbalanced Credit Scoring Datasets: A Systematic Review

Classification of Imbalanced Credit scoring data sets Based on Ensemble Method with the Weighted-Hybrid-Sampling

Improving financial distress prediction using machine learning: A preliminary study

An empirical evaluation of sampling methods for the classification of imbalanced data

Influence of the Event Rate on Discrimination Abilities of Bankruptcy Prediction Models

Entropy-based Time-series Financial Distress Model Based on Attribute Selection and MetaCost Methods for Imbalance Class