Research on Expansion and Classification of Imbalanced Data Based on SMOTE Algorithm

Shujuan Wang,Yuntao Dai,Jihong Shen,Jingxue Xuan
DOI: https://doi.org/10.21203/rs.3.rs-800351/v1
2021-01-01
Abstract:Abstract With the development of artificial intelligence, the research of medical auxiliary diagnosis based on big data classification is considered as a new technology that can be expected. Due to the different condition in the collection of different samples, medical big data often has imbalances. The class imbalance problems have been reported to severely hinder classification performance of many standard learning algorithms, and have attracted a great deal of attention from researchers of different fields. Focusing on this problem, an improved SMOTE algorithm based on Normal distribution is proposed in this paper. The principle of Normal random distribution is introduced to expand the minority sample, so that the new sample points are distributed closer to the center of the minority sample with a higher probability. In addition, the distribution of the generated data is controlled based on the characteristics of the Normal distribution. And the influence of the statistical characteristics of the original data on the parameter(variance) selection is analyzed based on the inter-class distance and sample variance. Experiments show that the proposed algorithm has better classification effect on the Pima, WDBC, WPBC, Ionosphere and Breast-cancer-wisconsin imbalanced datasets than the original SMOTE algorithm according to AUC, OOB, F-value, G-value.
What problem does this paper attempt to address?