Abstract:Imbalanced data classification remains a research hotspot and a challenging problem in the field of machine learning. The challenge of imbalanced learning lies not only in class imbalance problem, but also in the class overlapping problem which is complex. However, most of the existing algorithms mainly focus on the former. The limitation prevents the existing methods from breaking through. To address this limitation, this paper proposes an ensemble algorithm based on dual clustering and stage-wise hybrid sampling (DCSHS) to address both class imbalance and class overlapping problems. The DCSHS has three main parts: projection clustering combination framework (PCC), stage-wise hybrid sampling (SHS) and envelope clustering transfer mapping mechanism (CTM). PCC is to create multiple subsets through projective clustering. SHS is to identify the overlapping region of each subset and conduct hybrid sampling. CTM is to explore more information of samples in each subset by combining the clustering and transfer learning. At first, we design a PCC framework guided by Davies-Bouldin clustering effectiveness index (DBI), which is used to obtain high-quality clusters and combine them to obtain a set of cross-complete subsets (CCS) with low overlapping. Secondly, according to the characteristics of subset classes, a SHS algorithm is designed to realize the de-overlapping and balancing of subsets. Finally, an envelope clustering transfer mapping mechanism (CTM) is constructed for all processed subsets by means of transfer learning, thereby reducing class overlapping and explore structural information of samples. Weak classifiers are trained on the balanced subsets, and fused as all the imbalanced ensemble algorithms did. The major advantage of our algorithm is that it can exploit the intersectionality of the CCS to realize the soft elimination of overlapping majority samples, and learn as much information of overlapping samples as possible, thereby enhancing the class overlapping while class balancing. In the experimental section, more than 30 public datasets and over ten representative algorithms are chosen for verification. The experimental results show that the DCSHS is significantly best in terms of anti-overlapping, Recall, F1-M, G-M, AUC, and diversity.

Imbalanced Data Classification Algorithm Based on Integrated Sampling and Ensemble Learning.

Imbalanced Data Sets Classification Method Based on Over-Sampling Technique

Hybrid SVM algorithm oriented to classifying imbalanced datasets

Improved SVM algorithm for imbalanced dataset classification

A Classfication Method For Imbalance Data Set Based on Kernel SMOTE

An Imbalanced Data Classification Method Based on Automatic Clustering Under-Sampling

Classification of Imbalanced Credit scoring data sets Based on Ensemble Method with the Weighted-Hybrid-Sampling

An ensemble imbalanced classification method based on model dynamic selection driven by data partition hybrid sampling

A hybrid ensemble and evolutionary algorithm for imbalanced classification and its application on bioinformatics

Self-adaptive cost weights-based support vector machine cost-sensitive ensemble for imbalanced data classification

A hybrid sampling method for highly imbalanced and overlapped data classification with complex distribution

Adaptive Sampling With Optimal Cost For Class-Imbalance Learning

Imbalanced Data Classification Based on Improved Random-SMOTE and Feature Standard Deviation

Adaptive Fuzzy Multi-Neighborhood Feature Selection with Hybrid Sampling and Its Application for Class-Imbalanced Data

A weighted hybrid ensemble method for classifying imbalanced data

Entropy‐based hybrid sampling ensemble learning for imbalanced data

An Imbalanced Ensemble Learning Method Based on Dual Clustering and Stage-Wise Hybrid Sampling

Several SVM Ensemble Methods Integrated with Under-Sampling for Imbalanced Data Learning

Adaptive Fusion Based Method for Imbalanced Data Classification

An adaptive over-sampling method for imbalanced data based on simultaneous clustering and filtering noisy

Over-sampling algorithm for imbalanced data classification