Balanced Split: A new train-test data splitting strategy for imbalanced datasets

Azal Ahmad Khan
DOI: https://doi.org/10.48550/arXiv.2212.11116
2022-12-17
Abstract:Classification data sets with skewed class proportions are called imbalanced. Class imbalance is a problem since most machine learning classification algorithms are built with an assumption of equal representation of all classes in the training dataset. Therefore to counter the class imbalance problem, many algorithm-level and data-level approaches have been developed. These mainly include ensemble learning and data augmentation techniques. This paper shows a new way to counter the class imbalance problem through a new data-splitting strategy called balanced split. Data splitting can play an important role in correctly classifying imbalanced datasets. We show that the commonly used data-splitting strategies have some disadvantages, and our proposed balanced split has solved those problems.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is some defects in the existing data splitting strategies when dealing with imbalanced data sets. Specifically, the paper points out: 1. **No guarantee for the minority class in the training data set**: In random splitting or stratified splitting, it cannot be guaranteed that samples of the minority class will definitely appear in the training data set, which may lead to a weaker ability of the model to recognize the minority class. 2. **The imbalance problem in the training data set**: Even if stratified splitting is used, the class imbalance problem in the training data set cannot be completely eliminated, because stratified splitting only tries to keep the class proportions in the training set and the test set consistent, but does not guarantee that the number of samples of each class in the training set is equal. To solve these problems, the paper proposes a new data splitting strategy - **Balanced Split**. The main principle of balanced splitting is to ensure that the number of samples of each class in the training data set is equal, thus solving the class imbalance problem at the training stage. Through theoretical analysis and experimental verification, the paper shows the superior performance of balanced splitting on multiple classification algorithms.