Abstract:Imbalanced data classification remains a research hotspot and a challenging problem in the field of machine learning. The challenge of imbalanced learning lies not only in class imbalance problem, but also in the class overlapping problem which is complex. However, most of the existing algorithms mainly focus on the former. The limitation prevents the existing methods from breaking through. To address this limitation, this paper proposes an ensemble algorithm based on dual clustering and stage-wise hybrid sampling (DCSHS) to address both class imbalance and class overlapping problems. The DCSHS has three main parts: projection clustering combination framework (PCC), stage-wise hybrid sampling (SHS) and envelope clustering transfer mapping mechanism (CTM). PCC is to create multiple subsets through projective clustering. SHS is to identify the overlapping region of each subset and conduct hybrid sampling. CTM is to explore more information of samples in each subset by combining the clustering and transfer learning. At first, we design a PCC framework guided by Davies-Bouldin clustering effectiveness index (DBI), which is used to obtain high-quality clusters and combine them to obtain a set of cross-complete subsets (CCS) with low overlapping. Secondly, according to the characteristics of subset classes, a SHS algorithm is designed to realize the de-overlapping and balancing of subsets. Finally, an envelope clustering transfer mapping mechanism (CTM) is constructed for all processed subsets by means of transfer learning, thereby reducing class overlapping and explore structural information of samples. Weak classifiers are trained on the balanced subsets, and fused as all the imbalanced ensemble algorithms did. The major advantage of our algorithm is that it can exploit the intersectionality of the CCS to realize the soft elimination of overlapping majority samples, and learn as much information of overlapping samples as possible, thereby enhancing the class overlapping while class balancing. In the experimental section, more than 30 public datasets and over ten representative algorithms are chosen for verification. The experimental results show that the DCSHS is significantly best in terms of anti-overlapping, Recall, F1-M, G-M, AUC, and diversity.

An Imbalanced Data Classification Method Based on Automatic Clustering Under-Sampling

Imbalanced Data Sets Classification Method Based on Over-Sampling Technique

Imbalanced Data Classification Algorithm Based on Integrated Sampling and Ensemble Learning.

Under-sampling class imbalanced datasets by combining clustering analysis and instance selection

A Density-based Under-sampling Algorithm for Imbalance Classification

An Imbalanced Ensemble Learning Method Based on Dual Clustering and Stage-Wise Hybrid Sampling

A hybrid ensemble and evolutionary algorithm for imbalanced classification and its application on bioinformatics

A Normal Distribution-Based Over-Sampling Approach to Imbalanced Data Classification

An Over Sampling Method of Unbalanced Data Based on Ant Colony Clustering

An ensemble imbalanced classification method based on model dynamic selection driven by data partition hybrid sampling

An effective method using clustering-based adaptive decomposition and editing-based diversified oversamping for multi-class imbalanced datasets

Hybrid SVM algorithm oriented to classifying imbalanced datasets

An adaptive over-sampling method for imbalanced data based on simultaneous clustering and filtering noisy

Adaptive Subspace Optimization Ensemble Method for High-Dimensional Imbalanced Data Classification

CUSBoost: Cluster-based Under-sampling with Boosting for Imbalanced Classification

Self-paced Ensemble for Highly Imbalanced Massive Data Classification

Improved SVM algorithm for imbalanced dataset classification

An Improved MAHAKIL Oversampling Method for Imbalanced Dataset Classification

Self-adaptive cost weights-based support vector machine cost-sensitive ensemble for imbalanced data classification

Imbalanced Data Classification Based on Improved Random-SMOTE and Feature Standard Deviation