Abstract:Most existing imbalanced data classification models mainly focus on the classification performance of majority class samples, and many clustering algorithms need to manually specify the initial cluster centers and the number of clusters. To solve these drawbacks, this study presents a novel feature reduction method for imbalanced data classification using similarity-based feature clustering with adaptive weighted k-nearest neighbors (AWKNN). First, the similarity between samples is evaluated by the difference and smaller value between samples on each dimension, a similarity measure matrix is then developed to measure the similarity between clusters, after which a new hierarchical clustering model is constructed. By combining the cluster center of each sample cluster with its nearest neighbor, new samples are generated. Then, a hybrid sampling model based on similarity measure is presented by putting the generated samples into imbalanced data and removing samples from majority classes. Thus, a balanced decision system is constructed based on generated samples and minority class samples. Second, to address the issues that the traditional symmetric uncertainty only considers the correlation between features, and mutual information ignores the added information after classification, the normalized information gain is introduced to design new symmetric uncertainty between each feature and the other features; then, the ordered sequence and the average of the symmetric uncertainty difference of each feature are provided to adaptively select the k-nearest neighbors of features. Moreover, the weight of the k-th nearest neighbor of features is defined to present the AWKNN density of features and their ordered sequence for clustering features. Finally, by combining the weighted average redundancy with the symmetric uncertainty between features and decision classes, the maximum relevance between each feature and decision classes, and the minimum redundancy among features in the same cluster is presented to select the optimal feature subset from the feature clusters. Experiments applied to 29 imbalanced datasets show that the developed algorithm is effective and can select the optimal feature subset with high classification accuracy for imbalanced data.

Incremental Reduction of Imbalanced Distributed Mixed Data Based on K-Nearest Neighbor Rough Set

Feature Selection for Unbalanced Distribution Hybrid Data Based on ${K}$-Nearest Neighborhood Rough Set

Incremental neighborhood entropy-based feature selection for mixed-type data under the variation of feature set

Feature reduction for imbalanced data classification using similarity-based feature clustering with adaptive weighted K-nearest neighbors

Feature Selection Method on Imbalanced Text

Incremental feature selection approach to multi-dimensional variation based on matrix dominance conditional entropy for ordered data set

Incremental reduction methods based on granular ball neighborhood rough sets and attribute grouping

Matrix-Based Incremental Feature Selection Method Using Weight-Partitioned Multigranulation Rough Set

Incremental feature selection based on fuzzy rough sets

Bi-directional Adaptive Neighborhood Rough Sets Based Attribute Subset Selection.

A composite entropy-based uncertainty measure guided attribute reduction for imbalanced mixed-type data

Adaptive Fuzzy Multi-Neighborhood Feature Selection with Hybrid Sampling and Its Application for Class-Imbalanced Data

Feature selection algorithm using neighborhood equivalence tolerance relation for incomplete decision systems

Attribute Reduction with Personalized Information Granularity of Nearest Mutual Neighbors

Improved Oversampling Algorithm for Imbalanced Data Based on K-Nearest Neighbor and Interpolation Process Optimization

Feature selection considering feature relevance, redundancy and interactivity for neighborhood decision systems

A heuristic hybrid instance reduction approach based on adaptive relative distance and k-means clustering

An Emerging Fuzzy Feature Selection Method Using Composite Entropy-Based Uncertainty Measure and Data Distribution

Discernible Neighborhood Counting Based Incremental Feature Selection for Heterogeneous Data

Mixed Measure-Based Feature Selection Using the Fisher Score and Neighborhood Rough Sets