An Overlapping Oriented Imbalanced Ensemble Learning Algorithm with Weighted Projection Clustering Grouping and Consistent Fuzzy Sample Transformation

Fan Li,Bo Wang,Yinghua Shen,Pin Wang,Yongming Li
DOI: https://doi.org/10.1016/j.ins.2023.118955
IF: 8.1
2023-01-01
Information Sciences
Abstract:Class imbalance and class overlapping problems can exist simultaneously in the imbalanced learning. However, most of the existing algorithms mainly focus on the former. Although some recent algorithms focus on the class overlapping problem, they do not effectively identify the overlapping region, resulting in a loss of sample information, and they are always applied to the original samples with low quality. To address these problems, this paper proposes an imbalanced ensemble learning algorithm based on weighted projection clustering grouping and consistent fuzzy sample transformation (PCGDST-IE). Firstly, a weighted projection clustering combination framework (WPCC) guided by Davies-Bouldin clustering effectiveness index (DBI) is designed to obtain high-quality clusters and the clusters are combined to form cross-complete subsets (CCS) with low overlapping. Secondly, a stage-wise hybrid sampling algorithm is designed to realize the de-overlapping and balancing of subsets. Finally, a local–global structure consistency mechanism (LGSCM) is constructed by fuzzy clustering and domain adaption, thereby reducing class overlapping and improving the quality of samples in subsets. Weak classifiers are trained on the balanced subsets, and fused. More than 30 public datasets and over ten representative algorithms are chosen to verify the proposed method. The experimental results show that the PCGDST-IE is significantly better in terms of anti-overlapping, Recall, F1-M, G-M, AUC, and diversity. The major originality of the paper is: (a) proposing the WPCC to realize weighted projection clustering for subsets generation; (b) proposing the SHS to balance class imbalance and overlapping better;(c) proposing the LGSCM for sample transformation to address the quality of subsets; and (d) forming an imbalanced algorithm to better solve the class imbalance and class overlapping problems simultaneously.
What problem does this paper attempt to address?