Abstract:Researchers often perform data-driven variable selection when modeling the associations between an outcome and multiple independent variables in regression analysis. Variable selection may improve the interpretability, parsimony and/or predictive accuracy of a model. Yet variable selection can also have negative consequences, such as false exclusion of important variables or inclusion of noise variables, biased estimation of regression coefficients, underestimated standard errors and invalid confidence intervals, as well as model instability. While the potential advantages and disadvantages of variable selection have been discussed in the literature for decades, few large-scale simulation studies have neutrally compared data-driven variable selection methods with respect to their consequences for the resulting models. We present the protocol for a simulation study that will evaluate different variable selection methods: forward selection, stepwise forward selection, backward elimination, augmented backward elimination, univariable selection, univariable selection followed by backward elimination, and penalized likelihood approaches (Lasso, relaxed Lasso, adaptive Lasso). These methods will be compared with respect to false inclusion and/or exclusion of variables, consequences on bias and variance of the estimated regression coefficients, the validity of the confidence intervals for the coefficients, the accuracy of the estimated variable importance ranking, and the predictive performance of the selected models. We consider both linear and logistic regression in a low-dimensional setting (20 independent variables with 10 true predictors and 10 noise variables). The simulation will be based on real-world data from the National Health and Nutrition Examination Survey (NHANES). Publishing this study protocol ahead of performing the simulation increases transparency and allows integrating the perspective of other experts into the study design.

Comparing Stochastic Optimization Methods for Variable Selection in Binary Outcome Prediction, With Application to Health Policy

A Case Study of Stochastic Optimization in Health Policy: Problem Formulation and Preliminary Results

Variable Selection in Logistic Regression Model with Genetic Algorithm.

Bayesian variable selection using cost-adjusted BIC, with application to cost-effective measurement of quality of health care

Variable selection with missing data in both covariates and outcomes: Imputation and machine learning

LOOKAHEAD AND PILOTING STRATEGIES FOR VARIABLE SELECTION

Bayesian variable selection in linear regression models with instrumental variables

Variable selection for competing risk regression models: recommendations for analyzing data from epidemiological studies

Nonparametric Assessment of Variable Selection and Ranking Algorithms

Sequential Advantage Selection for Optimal Treatment Regimes

Comparison of methods for early-readmission prediction in a high-dimensional heterogeneous covariates and time-to-event outcome framework

Decision Curve Analysis: a Technical Note

Evaluating variable selection methods for multivariable regression models: A simulation study protocol

Scalable Bayesian bi-level variable selection in generalized linear models

Development and Application of a Genetic Algorithm for Variable Optimization and Predictive Modeling of Five-Year Mortality Using Questionnaire Data

Bayesian Variable Selection with Related Predictors

A hybrid deterministic–deterministic approach for high-dimensional Bayesian variable selection with a default prior

A comparison of methods for model selection when estimating individual treatment effects

Scalable Approximations of Marginal Posteriors in Variable Selection

Bayesian outcome selection modelling

Identification of the Optimal Treatment Regimen in the Presence of Missing Covariates