An automatic sampling ratio detection method based on genetic algorithm for imbalanced data classification
Ming Zheng,Tong Li,Liping Sun,Taochun Wang,Biao Jie,Weiyi Yang,Mingjing Tang,Changlong Lv
DOI: https://doi.org/10.1016/j.knosys.2021.106800
IF: 8.139
2021-01-01
Knowledge-Based Systems
Abstract:Imbalanced data are a common phenomenon in both theoretical research and real-world applications. At a data level, standard classification algorithms cannot effectively learn and make predictions from imbalanced data, and this problem is generally solved by using oversampling, undersampling, or hybrid sampling methods. However, most of the current sampling methods use random sampling ratios, and the resulting classification performance can be undesirable and unstable. To obtain satisfactory and stable classification performance, we proposed three algorithms to automatically determine the sampling ratios for oversampling, undersampling, and hybrid sampling methods, based on a genetic algorithm. Experiments were performed to test the algorithms' effectiveness by utilizing five widely used standard classification algorithms on 14 different imbalanced datasets using two oversampling, two undersampling, and four hybrid sampling methods. The statistical test results showed that for all five standard classification algorithms, sampling methods that used our proposed algorithms achieved the best classification results. Using area under the receiver operating characteristic curve (AUC) as the evaluation metric, it was demonstrated that the proposed algorithms for automatically determining the sampling ratio outperformed the random sampling ratio. (C) 2021 Elsevier B.V. All rights reserved.