Sample and feature selecting based ensemble learning for imbalanced problems
Zhe Wang,Peng Jia,Xinlei Xu,Bolu Wang,Yujin Zhu,Dongdong Li
DOI: https://doi.org/10.1016/j.asoc.2021.107884
IF: 8.7
2021-12-01
Applied Soft Computing
Abstract:Imbalanced problem is concerned with the performance of classifiers on the data set with severe class imbalance distribution. Traditional methods are misled by the majority samples to make the incorrect prediction and fail to make full use of minority samples. This paper is motivated to design a novel hybrid ensemble learning strategy named Sample and Feature Selection Hybrid Ensemble Learning (SFSHEL) and combine it with random forest to improve the classification performance of imbalanced data. Specifically, SFSHEL considers cluster-based stratification to undersample the majority samples and adopts sliding windows mechanism to generate a diversity of feature subsets, simultaneously. Then the weights trained through validation are assigned to different base learners and SFSHEL makes the prediction by weighted voting at last. In this manner, SFSHEL could not only guarantee the acceptable performance, but also save computational time. Furthermore, the weighting process makes SFSHEL interpret the importance of each selected feature set, which is important in the real-world scenarios. The contributions of the proposed strategy are: (1) reducing the impact of class imbalance distribution, (2) assigning based learner weights only once after the training process, and (3) generating weights of features to help interpret the importance of clinical features. In practice, the random forest is adopted as the base learner for SFSHEL, so as to build a classifier abbreviated as SFSHEL-RF. The experiments show the average performance of the proposed SFSHEL-RF on a part of KEEL dataset reaches 91.37%, which is comparable to our previous best ECUBoost-RF method and higher than the other eleven methods. On the clinical heart failure datasets, the performance of SFSHEL-RF can stably reach the level of the top three with three indicators. The experimental results on both the standard imbalanced and clinical heart failure datasets validate the effectiveness and stability of SFSHEL-RF.
computer science, artificial intelligence, interdisciplinary applications