Abstract:Most existing imbalanced data classification models mainly focus on the classification performance of majority class samples, and many clustering algorithms need to manually specify the initial cluster centers and the number of clusters. To solve these drawbacks, this study presents a novel feature reduction method for imbalanced data classification using similarity-based feature clustering with adaptive weighted k-nearest neighbors (AWKNN). First, the similarity between samples is evaluated by the difference and smaller value between samples on each dimension, a similarity measure matrix is then developed to measure the similarity between clusters, after which a new hierarchical clustering model is constructed. By combining the cluster center of each sample cluster with its nearest neighbor, new samples are generated. Then, a hybrid sampling model based on similarity measure is presented by putting the generated samples into imbalanced data and removing samples from majority classes. Thus, a balanced decision system is constructed based on generated samples and minority class samples. Second, to address the issues that the traditional symmetric uncertainty only considers the correlation between features, and mutual information ignores the added information after classification, the normalized information gain is introduced to design new symmetric uncertainty between each feature and the other features; then, the ordered sequence and the average of the symmetric uncertainty difference of each feature are provided to adaptively select the k-nearest neighbors of features. Moreover, the weight of the k-th nearest neighbor of features is defined to present the AWKNN density of features and their ordered sequence for clustering features. Finally, by combining the weighted average redundancy with the symmetric uncertainty between features and decision classes, the maximum relevance between each feature and decision classes, and the minimum redundancy among features in the same cluster is presented to select the optimal feature subset from the feature clusters. Experiments applied to 29 imbalanced datasets show that the developed algorithm is effective and can select the optimal feature subset with high classification accuracy for imbalanced data.

Feature Selection for Unbalanced Distribution Hybrid Data Based on ${K}$-Nearest Neighborhood Rough Set

Unsupervised Feature Selection with Ordinal Locality.

An Unsupervised Feature Selection Method Based on Improved ReliefF and Bisecting K-means

An Emerging Fuzzy Feature Selection Method Using Composite Entropy-Based Uncertainty Measure and Data Distribution

U^2F^2S^2 : Uncovering Feature-level Similarities for Unsupervised Feature Selection

Feature Selection Method on Imbalanced Text

Incremental neighborhood entropy-based feature selection for mixed-type data under the variation of feature set

Feature selection of dominance-based neighborhood rough set approach for processing hybrid ordered data

Adaptive Fuzzy Multi-Neighborhood Feature Selection with Hybrid Sampling and Its Application for Class-Imbalanced Data

Feature selection considering feature relevance, redundancy and interactivity for neighborhood decision systems

Multi-label feature selection based on fuzzy neighborhood rough sets

Feature reduction for imbalanced data classification using similarity-based feature clustering with adaptive weighted K-nearest neighbors

Bi-directional Adaptive Neighborhood Rough Sets Based Attribute Subset Selection.

Feature selection in mixed data: A method using a novel fuzzy rough set-based information entropy

Feature selection algorithm using neighborhood equivalence tolerance relation for incomplete decision systems

Mixed Measure-Based Feature Selection Using the Fisher Score and Neighborhood Rough Sets

Feature selection for label distribution learning using dual-similarity based neighborhood fuzzy entropy

Accelerating information entropy-based feature selection using rough set theory with classified nested equivalence classes

Feature selection based on multiview entropy measures in multiperspective rough set

Feature subset selection for multi-scale neighborhood decision information system via mutual information