Abstract:In numerous classification problems, class distribution is not balanced. For example, positive examples are rare in the fields of disease diagnosis and credit card fraud detection. General machine learning methods are known to be suboptimal for such imbalanced classification. One popular solution is to balance training data by oversampling the underrepresented (or undersampling the overrepresented) classes before applying machine learning algorithms. However, despite its popularity, the effectiveness of sampling has not been rigorously and comprehensively evaluated. This study assessed combinations of seven sampling methods and eight machine learning classifiers (56 varieties in total) using 31 datasets with varying degrees of imbalance. We used the areas under the precision-recall curve (AUPRC) and receiver operating characteristics curve (AUROC) as the performance measures. The AUPRC is known to be more informative for imbalanced classification than the AUROC. We observed that sampling significantly changed the performance of the classifier (paired t-tests P < 0.05) only for few cases (12.2% in AUPRC and 10.0% in AUROC). Surprisingly, sampling was more likely to reduce rather than improve the classification performance. Moreover, the adverse effects of sampling were more pronounced in AUPRC than in AUROC. Among the sampling methods, undersampling performed worse than others. Also, sampling was more effective for improving linear classifiers. Most importantly, we did not need sampling to obtain the optimal classifier for most of the 31 datasets. In addition, we found two interesting examples in which sampling significantly reduced AUPRC while significantly improving AUROC (paired t-tests P < 0.05). In conclusion, the applicability of sampling is limited because it could be ineffective or even harmful. Furthermore, the choice of the performance measure is crucial for decision making. Our results provide valuable insights into the effect and characteristics of sampling for imbalanced classification.

An Empirical Study of Bagging Predictors for Imbalanced Data with Different Levels of Class Distribution

Imbalanced Data Sets Classification Method Based on Over-Sampling Technique

An adaptive Bagging algorithm based on lightweight transformer for multi-class imbalance recognition

An empirical evaluation of sampling methods for the classification of imbalanced data

Predicting class-imbalanced business risk using resampling, regularization, and model ensembling algorithms

A Weighted Subspace Approach for Improving Bagging Performance

An Empirical Study on the Joint Impact of Feature Selection and Data Re-sampling on Imbalance Classification

A hybrid ensemble and evolutionary algorithm for imbalanced classification and its application on bioinformatics

A Comparative Study of the Use of Stratified Cross-Validation and Distribution-Balanced Stratified Cross-Validation in Imbalanced Learning

An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants

Under-bagging Nearest Neighbors for Imbalanced Classification

Hybrid SVM algorithm oriented to classifying imbalanced datasets

Empirical analysis of performance assessment for imbalanced classification

A replica analysis of under-bagging

An Imbalanced Data Classification Method Based on Automatic Clustering Under-Sampling

Handling missing values and imbalanced classes in machine learning to predict consumer preference: Demonstrations and comparisons to prominent methods

A Study of Data Pre-processing Techniques for Imbalanced Biomedical Data Classification

Experimental Study and Comparison of Imbalance Ensemble Classifiers with Dynamic Selection Strategy

Bagging K-Dependence Bayesian Network Classifiers

Under-sampling class imbalanced datasets by combining clustering analysis and instance selection

Several SVM Ensemble Methods Integrated with Under-Sampling for Imbalanced Data Learning