Subsampling Winner Algorithm for Feature Selection in Large Regression Data

Yiying Fan,Jiayang Sun
DOI: https://doi.org/10.48550/arXiv.2002.02903
2020-02-08
Abstract:Feature selection from a large number of covariates (aka features) in a regression analysis remains a challenge in data science, especially in terms of its potential of scaling to ever-enlarging data and finding a group of scientifically meaningful features. For example, to develop new, responsive drug targets for ovarian cancer, the actual false discovery rate (FDR) of a practical feature selection procedure must also match the target FDR. The popular approach to feature selection, when true features are sparse, is to use a penalized likelihood or a shrinkage estimation, such as a LASSO, SCAD, Elastic Net, or MCP procedure (call them benchmark procedures). We present a different approach using a new subsampling method, called the Subsampling Winner algorithm (SWA). The central idea of SWA is analogous to that used for the selection of US national merit scholars. SWA uses a "base procedure" to analyze each of the subsamples, computes the scores of all features according to the performance of each feature from all subsample analyses, obtains the "semifinalist" based on the resulting scores, and then determines the "finalists," i.e., the most important features. Due to its subsampling nature, SWA can scale to data of any dimension in principle. The SWA also has the best-controlled actual FDR in comparison with the benchmark procedures and the randomForest, while having a competitive true-feature discovery rate. We also suggest practical add-on strategies to SWA with or without a penalized benchmark procedure to further assure the chance of "true" discovery. Our application of SWA to the ovarian serous cystadenocarcinoma specimens from the Broad Institute revealed functionally important genes and pathways, which we verified by additional genomics tools. This second-stage investigation is essential in the current discussion of the proper use of P-values.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenge of feature selection in large - scale regression data, especially how to find a set of scientifically meaningful features when the data scale is constantly expanding. Specifically, for ovarian cancer research, the actual false discovery rate (FDR) must match the target FDR in order to develop new, responsive drug targets. Existing feature selection methods have limitations when dealing with high - dimensional data, especially in the case of sparse features, and these methods may not be able to effectively identify important features. Therefore, this paper proposes a new subsampling - based method - Subsampling Winner Algorithm (SWA) - aiming to solve these problems. ### Main problems: 1. **Feature selection in high - dimensional data**: In large - scale datasets, how to efficiently select important features from a large number of features? 2. **Controlling the false discovery rate**: How to effectively control the false discovery rate (FDR) while ensuring a high true positive rate, especially in the case of sparse features? 3. **Scalability**: How to design a method so that it can handle data of any dimension and be computationally feasible? ### Solutions: 1. **Subsampling Winner Algorithm (SWA)**: - **Subsampling**: Randomly draw subsamples from all features for analysis. - **Scoring**: Score all features according to the results of each subsample analysis. - **Semi - finalists**: Based on the scoring results, select features with higher scores as "semi - finalists". - **Finalists**: Further analyze the "semi - finalists" and finally determine the most important features. 2. **Performance advantages**: - **FDR control**: SWA performs excellently in controlling the actual FDR, outperforming benchmark methods such as Elastic Net, SCAD, MCP, and Random Forest. - **True positive rate**: At the same FDR level, SWA has a competitive true positive rate (TDR). - **Scalability**: Due to its subsampling nature, SWA can handle data of any dimension without being limited by memory or computing resources. ### Application examples: - **Ovarian cancer data**: Apply SWA to the dataset of ovarian serous cystadenocarcinoma, discover functionally important genes and pathways, and verify them through additional genomics tools. ### Conclusion: SWA provides an effective solution that can perform feature selection in large - scale regression data while maintaining good FDR control and true positive rate. This method has important application value in biomedical research, especially for developing new drug targets.