Abstract:The identification of suitable feature subsets from High-Dimensional Low-Sample-Size (HDLSS) data is of paramount importance because this dataset often contains numerous redundant and irrelevant features, leading to poor classification performance. However, the selection of an optimal feature subset from a vast feature space creates a significant computational challenge. In the domain of HDLSS data, conventional feature selection methods often face challenges in achieving a balance between reducing the number of features and preserving high classification accuracy. Addressing these issues, the study introduces an effective framework that employs a filter and wrapper-based strategy specifically designed to address the classification challenges inherent in HDLSS data. The framework adopts a multi-step approach where ensemble feature selection integrates five filter ranking approaches: Chi-square ( ), Gini index (GI), F-score, Mutual Information (MI), and Symmetric uncertainty (SU) to identify the top-ranking features. In the subsequent stage, a wrapper-based search method is utilized, which employs the Differential Evaluation (DE) metaheuristic algorithm as the search strategy. The fitness of feature subsets during this search is assessed based on a weighted combination of the error rate of the Support Vector Machine (SVM) classifier and the ratio of feature cardinality. The datasets, after undergoing dimensionality reduction, are then utilized to construct classification models using SVM, K-Nearest Neighbors (KNN), and Logistic Regression (LR). The approach was evaluated on 13 HDLSS datasets to assess its efficacy in selecting appropriate feature subsets and improving Classification Accuracy (ACC) analog with Area Under the Curve (AUC). Results show that the proposed ensemble with wrapper-based approach produces a smaller number of features (ranging between 2 and 9 for all datasets), while maintaining a commendable average AUC and ACC (between 98% and 100%). The comparative analysis reveals that the proposed method surpasses both ensemble feature selection and non-feature selection approaches in terms of feature reduction and ACC. Additionally, when compared to various other state-of-the-art methods, this approach demonstrates commendable performance.

A Weighted K-Center Algorithm for Data Subset Selection

Sample Weighting: an Inherent Approach for Outlier Suppressing Discriminant Analysis

Embrace Sustainable AI: Dynamic Data Subset Selection for Image Classification

Efficient Algorithms for the One-Dimensional K-Center Problem

Finding High-Value Training Data Subset through Differentiable Convex Programming

Scalable and space-efficient Robust Matroid Center algorithms

DISCERN: Diversity-based Selection of Centroids for k-Estimation and Rapid Non-stochastic Clustering

Near-Optimal Algorithms for Constrained k-Center Clustering with Instance-level Background Knowledge

When Do Birds of a Feather Flock Together? K-Means, Proximity, and Conic Programming.

Distance Weighted K-Means Algorithm for Center Selection in Training Radial Basis Function Networks

Less is more: Selecting informative and diverse subsets with balancing constraints

Performance analysis of Kmeans with modified initial centroid selection algorithms and developed Kmeans9+ model

Optimal Data Selection: An Online Distributed View

Feature selection based on weight updating and K-L distance

Matroid and Knapsack Center Problems

A Dataset Representativeness Metric and A Slicing Sampling Strategy for the Kennard-Stone Algorithm

Feature Subset Selection for High-Dimensional, Low Sampling Size Data Classification Using Ensemble Feature Selection With a Wrapper-Based Search

Communication-efficient k-Means for Edge-based Machine Learning

A Novel Effective Distance Measure and a Relevant Algorithm for Optimizing the Initial Cluster Centroids of K-means

Relax and Merge: A Simple Yet Effective Framework for Solving Fair $k$-Means and $k$-sparse Wasserstein Barycenter Problems

KNCFS: Feature selection for high-dimensional datasets based on improved random multi-subspace learning