A cluster-based oversampling algorithm combining SMOTE and k-means for imbalanced medical data

Zhaozhao Xu,Derong Shen,Tiezheng Nie,Yue Kou,Nan Yin,Xi Han
DOI: https://doi.org/10.1016/j.ins.2021.02.056
IF: 8.1
2021-01-01
Information Sciences
Abstract:The algorithm of C4.5 decision tree has the advantages of high classification accuracy, fast calculation speed and comprehensible classification rules, so it is widely used for medical data analysis. However, for imbalanced medical data, the classification accuracy of decision trees-based models is not ideal. Therefore, this paper proposes a cluster-based oversampling algorithm (KNSMOTE) combining Synthetic minority oversampling technique (SMOTE) and k-means algorithm. The sample classes clustered by k-means and the original sample classes are calculated to select the "safe samples" whose sample classes have not been changed. The "safe samples" are linearly interpolated to synthesize the new samples. The improved SMOTE sets the oversampling ratio according to the imbalance ratio of the original samples, which is used to synthesize the samples whose number is the same as that of the original samples. Compared with other oversampling algorithms on 8 UCI data sets, our algorithm has achieved significant advantages. Our algorithm was applied to the medical datasets, and the average values of the Sensitivity and Specificity indexes of the Random forest (RF) algorithm were 99.84% and 99.56%, respectively. (c) 2021 Elsevier Inc. All rights reserved.
What problem does this paper attempt to address?