Practical considerations for variable screening in the Super Learner

Brian D. Williamson,Drew King,Ying Huang

2023-11-07

Abstract:Estimating a prediction function is a fundamental component of many data analyses. The Super Learner ensemble, a particular implementation of stacking, has desirable theoretical properties and has been used successfully in many applications. Dimension reduction can be accomplished by using variable screening algorithms, including the lasso, within the ensemble prior to fitting other prediction algorithms. However, the performance of a Super Learner using the lasso for dimension reduction has not been fully explored in cases where the lasso is known to perform poorly. We provide empirical results that suggest that a diverse set of candidate screening algorithms should be used to protect against poor performance of any one screen, similar to the guidance for choosing a library of prediction algorithms for the Super Learner.

Machine Learning

What problem does this paper attempt to address?

The paper primarily explores the issue of using variable selection algorithms in Super Learner (an ensemble learning method), with a particular focus on the performance of lasso regression as a selection tool and its impact on overall predictive performance. Specifically, the paper aims to address the following key questions: 1. **Exploring the performance of lasso in Super Learner**: The researchers want to understand whether lasso, known to perform poorly in certain situations, negatively impacts the overall performance of Super Learner. 2. **Evaluating the effectiveness of different selection algorithms**: Through experiments, the study compares different variable selection methods (including lasso, rank-based correlation selection, univariate correlation-based selection, and random forests) to determine which selection strategies can improve predictive accuracy. 3. **Proposing a diverse combination of selection algorithms**: Given that a single selection method may perform poorly in specific scenarios, the authors suggest using multiple selection algorithms to construct the Super Learner, thereby protecting the model from the potential adverse effects of any single selection method. The paper validates these hypotheses through a series of numerical experiments, covering different types of variable relationships (linear and nonlinear), different feature correlations (correlated or uncorrelated), varying numbers of features (from low-dimensional to high-dimensional cases), and different sample sizes. The experimental results indicate that using lasso alone for selection leads to poor predictive performance in nonlinear relationships; however, if the Super Learner includes a rich set of candidate selection algorithms, the inclusion of lasso does not significantly degrade performance. This suggests that researchers should consider using a diverse library of selection algorithms when constructing a Super Learner.

Practical considerations for variable screening in the Super Learner

Faithful Variable Screening for High-Dimensional Convex Regression

High-dimensional regression in practice: an empirical study of finite-sample prediction, variable selection and ranking

Repeated Sieving for Prediction Model Building with High-Dimensional Data

RaSE: A Variable Screening Framework via Random Subspace Ensembles

Variable Screening for Sparse Online Regression.

Practical considerations for specifying a super learner

Look-Ahead Screening Rules for the Lasso

Robust group variable screening based on maximum Lq-likelihood estimation

Are screening methods useful in feature selection? An empirical study

Forward Regression for Ultra-High Dimensional Variable Screening

Greedy Forward Regression for Variable Screening

Adaptive Elastic Net and Separate Selection from Least Squares for Ultra-High Dimensional Regression Models

SCAD-Penalized Least Absolute Deviation Regression in High-Dimensional Models

A Two-Stage Variable Selection Approach for Correlated High Dimensional Predictors

"Pre-conditioning" for feature selection and regression in high-dimensional problems

Sequential profile Lasso for ultra-high-dimensional partially linear models

A Model-free Variable Screening Method Based on Leverage Score

Optimality of Graphlet Screening in High Dimensional Variable Selection

Sequential Feature Screening for Generalized Linear Models with Sparse Ultra-High Dimensional Data

Ultrahigh dimensional variable selection: beyond the linear model