GenSample: A Genetic Algorithm for Oversampling in Imbalanced Datasets

Vishwa Karia,Wenhao Zhang,Arash Naeim,Ramin Ramezani
DOI: https://doi.org/10.48550/arXiv.1910.10806
2019-10-24
Abstract:Imbalanced datasets are ubiquitous. Classification performance on imbalanced datasets is generally poor for the minority class as the classifier cannot learn decision boundaries well. However, in sensitive applications like fraud detection, medical diagnosis, and spam identification, it is extremely important to classify the minority instances correctly. In this paper, we present a novel technique based on genetic algorithms, GenSample, for oversampling the minority class in imbalanced datasets. GenSample decides the rate of oversampling a minority example by taking into account the difficulty in learning that example, along with the performance improvement achieved by oversampling it. This technique terminates the oversampling process when the performance of the classifier begins to deteriorate. Consequently, it produces synthetic data only as long as a performance boost is obtained. The algorithm was tested on 9 real-world imbalanced datasets of varying sizes and imbalance ratios. It achieved the highest F-Score on 8 out of 9 datasets, confirming its ability to better handle imbalanced data compared to other existing methodologies.
Machine Learning,Neural and Evolutionary Computing
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper aims to address the classification problem in imbalanced datasets. In imbalanced datasets, the number of minority class samples is much smaller than that of the majority class samples, which makes it difficult for classifiers to learn the decision boundary of the minority class, thereby affecting the classification performance of minority class samples. However, in sensitive applications such as fraud detection, medical diagnosis, and spam recognition, correctly classifying minority class instances is extremely important. To this end, the authors propose a new technique based on genetic algorithms—GenSample, for oversampling minority classes in imbalanced datasets. GenSample determines the degree of oversampling by considering the learning difficulty of each minority class sample and the performance improvement after oversampling, and terminates the oversampling process when the classifier performance starts to decline. This ensures that synthetic data is generated only when it improves performance, thereby avoiding performance degradation caused by excessive oversampling. ### Main Contributions 1. **Genetic Algorithm-Based Oversampling Method**: GenSample combines the selection, crossover, and mutation operations of genetic algorithms to iteratively learn which minority class samples are most suitable for oversampling. 2. **Adaptive Termination Condition**: The algorithm automatically terminates when oversampling leads to performance degradation, ensuring that the overall performance of the classifier is not reduced due to excessive oversampling. 3. **Experimental Evidence**: Experiments were conducted on 9 real-world datasets, and the results show that GenSample achieved the highest F1 score on 8 datasets, demonstrating its effectiveness in handling imbalanced datasets. ### Experimental Results - **Overall Accuracy**: GenSample outperformed other benchmark algorithms in terms of overall accuracy across all datasets, indicating that the algorithm improves minority class performance without sacrificing the accuracy of the majority class. - **F1 Score**: GenSample achieved the highest F1 score on 8 datasets, indicating its excellent performance in balancing precision and recall. - **Precision and Recall**: GenSample showed the best precision performance on 8 datasets and also had good recall performance. - **Geometric Mean**: GenSample achieved the highest geometric mean on more than 50% of the datasets, further validating its robustness in handling imbalanced datasets. ### Conclusion and Future Work The paper proposes GenSample, a genetic algorithm-based method for handling imbalanced datasets. Experimental results show that GenSample can effectively improve classification performance on most datasets, especially in terms of F1 score and precision. Future research directions include exploring other heuristic methods to further improve performance and combining GenSample with ensemble methods to achieve better results.