An Imbalanced Data Classification Method Based on Automatic Clustering Under-Sampling

Xiaoheng Deng,Weijian Zhong,Ju Ren,Detian Zeng,Honggang Zhang
DOI: https://doi.org/10.1109/pccc.2016.7820640
2016-01-01
Abstract:Classification of imbalanced datasets has become one of the most challenging problems in big data mining. Because the number of positive samples is far less than the negative samples, low accuracy and poor generalization performance and some other defects always go with learning process of traditional algorithms. Ensemble construction algorithm is an important method to handle this problem. Especially, the ensemble construction algorithm based on random under-sampling or clustering can effectively improve the performance of classification. However, the former causes information loss easily and the latter increases complexity. In this paper, we propose ACUS, an improved ensemble algorithm based on automatic clustering and under-sampling. ACUS conducts clustering first according to the weight of samples, and then it constructs balanced-distributed dataset which consists of a certain percentage of the majority class and all of the minority class from each cluster. With Adaboost algorithm construction, these datasets are used to get an ensemble classifier. Experimental results demonstrate the advantages of our proposed algorithm in terms of accuracy, simplicity and high stability.
What problem does this paper attempt to address?