Abstract:Most existing imbalanced data classification models mainly focus on the classification performance of majority class samples, and many clustering algorithms need to manually specify the initial cluster centers and the number of clusters. To solve these drawbacks, this study presents a novel feature reduction method for imbalanced data classification using similarity-based feature clustering with adaptive weighted k-nearest neighbors (AWKNN). First, the similarity between samples is evaluated by the difference and smaller value between samples on each dimension, a similarity measure matrix is then developed to measure the similarity between clusters, after which a new hierarchical clustering model is constructed. By combining the cluster center of each sample cluster with its nearest neighbor, new samples are generated. Then, a hybrid sampling model based on similarity measure is presented by putting the generated samples into imbalanced data and removing samples from majority classes. Thus, a balanced decision system is constructed based on generated samples and minority class samples. Second, to address the issues that the traditional symmetric uncertainty only considers the correlation between features, and mutual information ignores the added information after classification, the normalized information gain is introduced to design new symmetric uncertainty between each feature and the other features; then, the ordered sequence and the average of the symmetric uncertainty difference of each feature are provided to adaptively select the k-nearest neighbors of features. Moreover, the weight of the k-th nearest neighbor of features is defined to present the AWKNN density of features and their ordered sequence for clustering features. Finally, by combining the weighted average redundancy with the symmetric uncertainty between features and decision classes, the maximum relevance between each feature and decision classes, and the minimum redundancy among features in the same cluster is presented to select the optimal feature subset from the feature clusters. Experiments applied to 29 imbalanced datasets show that the developed algorithm is effective and can select the optimal feature subset with high classification accuracy for imbalanced data.

NearCount: Selecting critical instances based on the cited counts of nearest neighbors

Instance-Ranking: A New Perspective to Consider the Instance Dependency for Classification

Under-bagging Nearest Neighbors for Imbalanced Classification

Evidential instance selection for K-nearest neighbor classification of big data

Feature Selection for Unbalanced Distribution Hybrid Data Based on ${K}$-Nearest Neighborhood Rough Set

Cluster-oriented instance selection for classification problems

Novel resampling algorithms with maximal cliques for class-imbalance problems

A parameter-free hybrid instance selection algorithm based on local sets with natural neighbors

Under-sampling class imbalanced datasets by combining clustering analysis and instance selection

Feature reduction for imbalanced data classification using similarity-based feature clustering with adaptive weighted K-nearest neighbors

An instance selection algorithm for fuzzy K-nearest neighbor

Nearest neighbors and density-based undersampling for imbalanced data classification with class overlap

Sample Weighting: an Inherent Approach for Outlier Suppressing Discriminant Analysis

Distributionally Robust Weighted $k$-Nearest Neighbors

Resampling approach for imbalanced data classification based on class instance density per feature value intervals

A heuristic hybrid instance reduction approach based on adaptive relative distance and k-means clustering

An Adaptive Spectral Clustering Algorithm Based on the Importance of Shared Nearest Neighbors.

A New Hashing based Nearest Neighbors Selection Technique for Big Datasets

Optimal Selection of Reference Set for the Nearest Neighbor Classification by Tabu Search

A Density-based Under-sampling Algorithm for Imbalance Classification

SMOTE-RkNN: A hybrid re-sampling method based on SMOTE and reverse k-nearest neighbors