Abstract:Feature selection from a large number of covariates (aka features) in a regression analysis remains a challenge in data science, especially in terms of its potential of scaling to ever-enlarging data and finding a group of scientifically meaningful features. For example, to develop new, responsive drug targets for ovarian cancer, the actual false discovery rate (FDR) of a practical feature selection procedure must also match the target FDR. The popular approach to feature selection, when true features are sparse, is to use a penalized likelihood or a shrinkage estimation, such as a LASSO, SCAD, Elastic Net, or MCP procedure (call them benchmark procedures). We present a different approach using a new subsampling method, called the Subsampling Winner algorithm (SWA). The central idea of SWA is analogous to that used for the selection of US national merit scholars. SWA uses a "base procedure" to analyze each of the subsamples, computes the scores of all features according to the performance of each feature from all subsample analyses, obtains the "semifinalist" based on the resulting scores, and then determines the "finalists," i.e., the most important features. Due to its subsampling nature, SWA can scale to data of any dimension in principle. The SWA also has the best-controlled actual FDR in comparison with the benchmark procedures and the randomForest, while having a competitive true-feature discovery rate. We also suggest practical add-on strategies to SWA with or without a penalized benchmark procedure to further assure the chance of "true" discovery. Our application of SWA to the ovarian serous cystadenocarcinoma specimens from the Broad Institute revealed functionally important genes and pathways, which we verified by additional genomics tools. This second-stage investigation is essential in the current discussion of the proper use of P-values.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenge of feature selection in large - scale regression data, especially how to find a set of scientifically meaningful features when the data scale is constantly expanding. Specifically, for ovarian cancer research, the actual false discovery rate (FDR) must match the target FDR in order to develop new, responsive drug targets. Existing feature selection methods have limitations when dealing with high - dimensional data, especially in the case of sparse features, and these methods may not be able to effectively identify important features. Therefore, this paper proposes a new subsampling - based method - Subsampling Winner Algorithm (SWA) - aiming to solve these problems. ### Main problems: 1. **Feature selection in high - dimensional data**: In large - scale datasets, how to efficiently select important features from a large number of features? 2. **Controlling the false discovery rate**: How to effectively control the false discovery rate (FDR) while ensuring a high true positive rate, especially in the case of sparse features? 3. **Scalability**: How to design a method so that it can handle data of any dimension and be computationally feasible? ### Solutions: 1. **Subsampling Winner Algorithm (SWA)**: - **Subsampling**: Randomly draw subsamples from all features for analysis. - **Scoring**: Score all features according to the results of each subsample analysis. - **Semi - finalists**: Based on the scoring results, select features with higher scores as "semi - finalists". - **Finalists**: Further analyze the "semi - finalists" and finally determine the most important features. 2. **Performance advantages**: - **FDR control**: SWA performs excellently in controlling the actual FDR, outperforming benchmark methods such as Elastic Net, SCAD, MCP, and Random Forest. - **True positive rate**: At the same FDR level, SWA has a competitive true positive rate (TDR). - **Scalability**: Due to its subsampling nature, SWA can handle data of any dimension without being limited by memory or computing resources. ### Application examples: - **Ovarian cancer data**: Apply SWA to the dataset of ovarian serous cystadenocarcinoma, discover functionally important genes and pathways, and verify them through additional genomics tools. ### Conclusion: SWA provides an effective solution that can perform feature selection in large - scale regression data while maintaining good FDR control and true positive rate. This method has important application value in biomedical research, especially for developing new drug targets.

Subsampling Winner Algorithm for Feature Selection in Large Regression Data

A Hybrid Feature Selection Algorithm and Its Application in Bioinformatics

The Weight-Based Feature Selection (WBFS) Algorithm Classifies Lung Cancer Subtypes Using Proteomic Data

The Unsupervised Feature Selection Algorithms Based on Standard Deviation and Cosine Similarity for Genomic Data Analysis

A Cancer Gene Selection Algorithm Based on the K-S Test and CFS

A Feature Selection Method Based on Feature Grouping and Genetic Algorithm

Gene Selection Using Gaussian Kernel Support Vector Machine Based Recursive Feature Elimination with Adaptive Kernel Width Strategy

Unbiased Prediction and Feature Selection in High-Dimensional Survival Regression

A Novel Algorithm for Feature Selection Using Penalized Regression with Applications to Single-Cell RNA Sequencing Data

A Modified Sequential Deep Floating Search Algorithm For Feature Selection

Gene selection based on recursive spider wasp optimizer guided by marine predators algorithm

An ensemble learning-based feature selection algorithm for identification of biomarkers of renal cell carcinoma

Stratified sampling for feature subspace selection in random forests for high dimensional data

Feature Selection for Optimized High-Dimensional Biomedical Data Using an Improved Shuffled Frog Leaping Algorithm

Feature Selection by Recursive Binary Gravitational Search Algorithm Optimization for Cancer Classification

Variable Selection for Fisher Linear Discriminant Analysis Using the Modified Sequential Backward Selection Algorithm for the Microarray Data

RIFS2D: A two-dimensional version of a randomly restarted incremental feature selection algorithm with an application for detecting low-ranked biomarkers

Feature Selection of Gene Expression Data Using a Modified Artificial Fish Swarm Algorithm With Population Variation

Coordinating Discernibility And Independence Scores Of Variables In A 2d Space For Efficient And Accurate Feature Selection

An Adaptive Feature Selection Method for Microarray Data Analysis

A Novel Feature Selection Method for Gene Expression Data Based on Samples Localization