Abstract:In numerous classification problems, class distribution is not balanced. For example, positive examples are rare in the fields of disease diagnosis and credit card fraud detection. General machine learning methods are known to be suboptimal for such imbalanced classification. One popular solution is to balance training data by oversampling the underrepresented (or undersampling the overrepresented) classes before applying machine learning algorithms. However, despite its popularity, the effectiveness of sampling has not been rigorously and comprehensively evaluated. This study assessed combinations of seven sampling methods and eight machine learning classifiers (56 varieties in total) using 31 datasets with varying degrees of imbalance. We used the areas under the precision-recall curve (AUPRC) and receiver operating characteristics curve (AUROC) as the performance measures. The AUPRC is known to be more informative for imbalanced classification than the AUROC. We observed that sampling significantly changed the performance of the classifier (paired t-tests P < 0.05) only for few cases (12.2% in AUPRC and 10.0% in AUROC). Surprisingly, sampling was more likely to reduce rather than improve the classification performance. Moreover, the adverse effects of sampling were more pronounced in AUPRC than in AUROC. Among the sampling methods, undersampling performed worse than others. Also, sampling was more effective for improving linear classifiers. Most importantly, we did not need sampling to obtain the optimal classifier for most of the 31 datasets. In addition, we found two interesting examples in which sampling significantly reduced AUPRC while significantly improving AUROC (paired t-tests P < 0.05). In conclusion, the applicability of sampling is limited because it could be ineffective or even harmful. Furthermore, the choice of the performance measure is crucial for decision making. Our results provide valuable insights into the effect and characteristics of sampling for imbalanced classification.

Relabeling & raking algorithm for imbalanced classification

Imbalanced Data Classification Algorithm Based on Integrated Sampling and Ensemble Learning.

Imbalanced Data Sets Classification Method Based on Over-Sampling Technique

A Novel Hybrid Sampling Framework for Imbalanced Learning

An empirical evaluation of sampling methods for the classification of imbalanced data

Resampling approach for imbalanced data classification based on class instance density per feature value intervals

A Density-based Under-sampling Algorithm for Imbalance Classification

The imbalance problem: A comparison of sampling approaches using different parameters and feature selection methods in the context of classification

Selecting the Suitable Resampling Strategy for Imbalanced Data Classification Regarding Dataset Properties. An Approach Based on Association Models

Handling Imbalanced Data: A Case Study for Binary Class Problems

A cluster impurity-based hybrid resampling for imbalanced classification problems

Weakly Supervised-Based Oversampling for High Imbalance and High Dimensionality Data Classification

Detecting representative data and generating synthetic samples to improve learning accuracy with imbalanced data sets

A Novel Adaptive Minority Oversampling Technique for Improved Classification in Data Imbalanced Scenarios

A Normal Distribution-Based Over-Sampling Approach to Imbalanced Data Classification

Feature Ranking and Screening for Class-Imbalanced Metabolomics Data Based on Rank Aggregation Coupled with Re-Balance

Selecting the suitable resampling strategy for imbalanced data classification regarding dataset properties

An Improving Majority Weighted Minority Oversampling Technique for Imbalanced Classification Problem

Noise-free sampling with majority framework for an imbalanced classification problem

Class Imbalance Problem: A Wrapper-Based Approach using Under-Sampling with Ensemble Learning