Abstract:Most existing imbalanced data classification models mainly focus on the classification performance of majority class samples, and many clustering algorithms need to manually specify the initial cluster centers and the number of clusters. To solve these drawbacks, this study presents a novel feature reduction method for imbalanced data classification using similarity-based feature clustering with adaptive weighted k-nearest neighbors (AWKNN). First, the similarity between samples is evaluated by the difference and smaller value between samples on each dimension, a similarity measure matrix is then developed to measure the similarity between clusters, after which a new hierarchical clustering model is constructed. By combining the cluster center of each sample cluster with its nearest neighbor, new samples are generated. Then, a hybrid sampling model based on similarity measure is presented by putting the generated samples into imbalanced data and removing samples from majority classes. Thus, a balanced decision system is constructed based on generated samples and minority class samples. Second, to address the issues that the traditional symmetric uncertainty only considers the correlation between features, and mutual information ignores the added information after classification, the normalized information gain is introduced to design new symmetric uncertainty between each feature and the other features; then, the ordered sequence and the average of the symmetric uncertainty difference of each feature are provided to adaptively select the k-nearest neighbors of features. Moreover, the weight of the k-th nearest neighbor of features is defined to present the AWKNN density of features and their ordered sequence for clustering features. Finally, by combining the weighted average redundancy with the symmetric uncertainty between features and decision classes, the maximum relevance between each feature and decision classes, and the minimum redundancy among features in the same cluster is presented to select the optimal feature subset from the feature clusters. Experiments applied to 29 imbalanced datasets show that the developed algorithm is effective and can select the optimal feature subset with high classification accuracy for imbalanced data.

Clustering-based incremental learning for imbalanced data classification

Imbalanced Data Sets Classification Method Based on Over-Sampling Technique

Imbalanced Data Classification Algorithm Based on Integrated Sampling and Ensemble Learning.

A Novel Imbalanced Data Classification Method Based on Weakly Supervised Learning for Fault Diagnosis

Dynamic Residual Classifier for Class Incremental Learning

An effective method using clustering-based adaptive decomposition and editing-based diversified oversamping for multi-class imbalanced datasets

Rethinking Class-Incremental Learning from a Dynamic Imbalanced Learning Perspective

Under-sampling class imbalanced datasets by combining clustering analysis and instance selection

A cluster impurity-based hybrid resampling for imbalanced classification problems

An adaptive over-sampling method for imbalanced data based on simultaneous clustering and filtering noisy

Imbalanced Deep Learning by Minority Class Incremental Rectification

Addressing Imbalance for Class Incremental Learning in Medical Image Classification

A self-organizing incremental neural network for imbalance learning

Effective Decision Boundary Learning for Class Incremental Learning

Feature reduction for imbalanced data classification using similarity-based feature clustering with adaptive weighted K-nearest neighbors

Iterative Metric Learning for Imbalance Data Classification

Adaptive Sampling With Optimal Cost For Class-Imbalance Learning

Multi-Granularity Regularized Re-Balancing for Class Incremental Learning

Detecting representative data and generating synthetic samples to improve learning accuracy with imbalanced data sets

Gradient Reweighting: Towards Imbalanced Class-Incremental Learning

An Imbalanced Data Classification Method Based on Automatic Clustering Under-Sampling