Abstract:Feature selection from a large number of covariates (aka features) in a regression analysis remains a challenge in data science, especially in terms of its potential of scaling to ever-enlarging data and finding a group of scientifically meaningful features. For example, to develop new, responsive drug targets for ovarian cancer, the actual false discovery rate (FDR) of a practical feature selection procedure must also match the target FDR. The popular approach to feature selection, when true features are sparse, is to use a penalized likelihood or a shrinkage estimation, such as a LASSO, SCAD, Elastic Net, or MCP procedure (call them benchmark procedures). We present a different approach using a new subsampling method, called the Subsampling Winner algorithm (SWA). The central idea of SWA is analogous to that used for the selection of US national merit scholars. SWA uses a "base procedure" to analyze each of the subsamples, computes the scores of all features according to the performance of each feature from all subsample analyses, obtains the "semifinalist" based on the resulting scores, and then determines the "finalists," i.e., the most important features. Due to its subsampling nature, SWA can scale to data of any dimension in principle. The SWA also has the best-controlled actual FDR in comparison with the benchmark procedures and the randomForest, while having a competitive true-feature discovery rate. We also suggest practical add-on strategies to SWA with or without a penalized benchmark procedure to further assure the chance of "true" discovery. Our application of SWA to the ovarian serous cystadenocarcinoma specimens from the Broad Institute revealed functionally important genes and pathways, which we verified by additional genomics tools. This second-stage investigation is essential in the current discussion of the proper use of P-values.

Sub-Setting Algorithm for Training Data Selection in Pattern Recognition

A model-free subdata selection method for classification

A Feature Selection Method Based on Feature Grouping and Genetic Algorithm

Change is Hard: A Closer Look at Subpopulation Shift

Embrace Sustainable AI: Dynamic Data Subset Selection for Image Classification

Efficient Data Subset Selection to Generalize Training Across Models: Transductive and Inductive Networks

Bi-directional Adaptive Neighborhood Rough Sets Based Attribute Subset Selection.

Finding High-Value Training Data Subset through Differentiable Convex Programming

A sub-sampling algorithm preventing outliers

Feature selection based on weight updating and K-L distance

Optimal Data Selection: An Online Distributed View

A Time-Sensitive Hybrid Learning Model for Patient Subgrouping.

Self-tuned Visual Subclass Learning with Shared Samples An Incremental Approach

SISE-PC: Semi-supervised Image Subsampling for Explainable Pathology

Subsampling Winner Algorithm for Feature Selection in Large Regression Data

A Weighted K-Center Algorithm for Data Subset Selection

Enhancing Neural Subset Selection: Integrating Background Information into Set Representations

GRAD-MATCH: Gradient Matching based Data Subset Selection for Efficient Deep Model Training

Multistrategy Learning Using Genetic Algorithms and Neural Networks for Pattern Classification

A penalized variable selection ensemble algorithm for high-dimensional group-structured data

Acute coronary syndrome risk prediction based on gradient boosted tree feature selection and recursive feature elimination: A dataset-specific modeling study