Comparing Stochastic Optimization Methods for Variable Selection in Binary Outcome Prediction, With Application to Health Policy

Dimitris Fouskakis,David Draper
DOI: https://doi.org/10.1198/016214508000001048
IF: 4.369
2008-12-01
Journal of the American Statistical Association
Abstract:Traditional variable-selection strategies in generalized linear models (GLMs) seek to optimize a measure of predictive accuracy without regard for the cost of data collection. When the purpose of such model building is the creation of predictive scales to be used in future studies with constrained budgets, the standard approach may not be optimal. We propose a Bayesian decision-theoretic framework for variable selection in binary-outcome GLMs where the budget for data collection is constrained and potential predictors may vary considerably in cost. The method is illustrated using data from a large study of quality of hospital care in the U.S. in the 1980s. Especially when the number of available predictors p is large, it is important to use an appropriate technique for optimization (e.g., in an application presented here where p = 83, the space over which we search has 283 ≐ 1025 elements, which is too large to explore using brute force enumeration). Specifically, we investigate simulated annealing (SA), genetic algorithms (GAs), and the tabu search (TS) method used in operations research, and we develop a context-specific version of SA, improved simulated annealing (ISA), that performs better than its generic counterpart. When p was modest in our study, we found that GAs performed relatively poorly for all but the very best user-defined input configurations, generic SA did not perform well, and TS had excellent median performance and was much less sensitive to suboptimal choice of user-defined inputs. When p was large in our study, the best versions of GA and ISA outperformed TS and generic SA. Our results are presented in the context of health policy but can apply to other quality assessment settings with dichotomous outcomes as well.
statistics & probability
What problem does this paper attempt to address?