Self-paced Ensemble for Highly Imbalanced Massive Data Classification

Zhining Liu,Wei Cao,Zhifeng Gao,Jiang Bian,Hechang Chen,Yi Chang,Tie-Yan Liu
DOI: https://doi.org/10.1109/icde48307.2020.00078
2019-01-01
Abstract:Many real-world applications reveal difficulties in learning classifiers fromimbalanced data. The rising big data era has been witnessing moreclassification tasks with large-scale but extremely imbalance and low-qualitydatasets. Most of existing learning methods suffer from poor performance or lowcomputation efficiency under such a scenario. To tackle this problem, weconduct deep investigations into the nature of class imbalance, which revealsthat not only the disproportion between classes, but also other difficultiesembedded in the nature of data, especially, noises and class overlapping,prevent us from learning effective classifiers. Taking those factors intoconsideration, we propose a novel framework for imbalance classification thataims to generate a strong ensemble by self-paced harmonizing data hardness viaunder-sampling. Extensive experiments have shown that this new framework, whilebeing very computationally efficient, can lead to robust performance even underhighly overlapping classes and extremely skewed distribution. Note that, ourmethods can be easily adapted to most of existing learning methods (e.g., C4.5,SVM, GBDT and Neural Network) to boost their performance on imbalanced data.
What problem does this paper attempt to address?