Abstract:In numerous classification problems, class distribution is not balanced. For example, positive examples are rare in the fields of disease diagnosis and credit card fraud detection. General machine learning methods are known to be suboptimal for such imbalanced classification. One popular solution is to balance training data by oversampling the underrepresented (or undersampling the overrepresented) classes before applying machine learning algorithms. However, despite its popularity, the effectiveness of sampling has not been rigorously and comprehensively evaluated. This study assessed combinations of seven sampling methods and eight machine learning classifiers (56 varieties in total) using 31 datasets with varying degrees of imbalance. We used the areas under the precision-recall curve (AUPRC) and receiver operating characteristics curve (AUROC) as the performance measures. The AUPRC is known to be more informative for imbalanced classification than the AUROC. We observed that sampling significantly changed the performance of the classifier (paired t-tests P < 0.05) only for few cases (12.2% in AUPRC and 10.0% in AUROC). Surprisingly, sampling was more likely to reduce rather than improve the classification performance. Moreover, the adverse effects of sampling were more pronounced in AUPRC than in AUROC. Among the sampling methods, undersampling performed worse than others. Also, sampling was more effective for improving linear classifiers. Most importantly, we did not need sampling to obtain the optimal classifier for most of the 31 datasets. In addition, we found two interesting examples in which sampling significantly reduced AUPRC while significantly improving AUROC (paired t-tests P < 0.05). In conclusion, the applicability of sampling is limited because it could be ineffective or even harmful. Furthermore, the choice of the performance measure is crucial for decision making. Our results provide valuable insights into the effect and characteristics of sampling for imbalanced classification.

Random Undersampling on Imbalance Time Series Data for Anomaly Detection

Anomaly detection-based undersampling for imbalanced classification problems

How Low Can You Go? Surfacing Prototypical In-Distribution Samples for Unsupervised Anomaly Detection

Rare Event Prediction Using Similarity Majority Under-Sampling Technique

Improving the undersampling technique by optimizing the termination condition for software defect prediction

A Novel Resampling Technique for Imbalanced Dataset Optimization

Self-Supervised Random Forest on Transformed Distribution for Anomaly Detection

Machine Learning-Based Anomaly Detection on Seawater Temperature Data with Oversampling

Radial-Based Undersampling for imbalanced data classification

An Imbalanced Data Classification Method Based on Automatic Clustering Under-Sampling

Radial-based undersampling approach with adaptive undersampling ratio determination

NUS: Noisy-Sample-Removed Undersampling Scheme for Imbalanced Classification and Application to Credit Card Fraud Detection

Random resampling algorithms for addressing the imbalanced dataset classes in insider threat detection

An empirical evaluation of sampling methods for the classification of imbalanced data

Resampling imbalanced data for network intrusion detection datasets

A fuzzy rough set-based undersampling approach for imbalanced data

Optimize the Coverage Probability of Prediction Interval for Anomaly Detection of Sensor-Based Monitoring Series

Low-count Time Series Anomaly Detection

Solving the Data Imbalance Problem of P300 Detection Via Random Under-Sampling Bagging SVMs.

On Predictive Explanation of Data Anomalies