Abstract:When fitting statistical models, some predictors are often found to be correlated with each other, and functioning together. Many group variable selection methods are developed to select the groups of predictors that are closely related to the continuous or categorical response. These existing methods usually assume the group structures are well known. For example, variables with similar practical meaning, or dummy variables created by categorical data. However, in practice, it is impractical to know the exact group structure, especially when the variable dimensional is large. As a result, the group variable selection results may be selected. To solve the challenge, we propose a two-stage approach that combines a variable clustering stage and a group variable stage for the group variable selection problem. The variable clustering stage uses information from the data to find a group structure, which improves the performance of the existing group variable selection methods. For ultrahigh dimensional data, where the predictors are much larger than observations, we incorporated a variable screening method in the first stage and shows the advantages of such an approach. In this article, we compared and discussed the performance of four existing group variable selection methods under different simulation models, with and without the variable clustering stage. The two-stage method shows a better performance, in terms of the prediction accuracy, as well as in the accuracy to select active predictors. An athlete's data is also used to show the advantages of the proposed method.

Which bridge estimator is optimal for variable selection?

Faithful Variable Screening for High-Dimensional Convex Regression

Regularization and variable selection for infinite variance autoregressive models

Variable Selection for Generalized Varying Coefficient Partially Linear Models with Diverging Number of Parameters

Large-P Variable Selection in Two-Stage Models

Variable Selection via Adaptive False Negative Control in Linear Regression

Consistent Tuning Parameter Selection in High Dimensional Sparse Linear Regression

Ultrahigh dimensional variable selection: beyond the linear model

Post Selection Shrinkage Estimation for High Dimensional Data Analysis

Adaptive Bi-Level Variable Selection for Quantile Regression Models with a Diverging Number of Covariates

Robust exponential squared loss-based variable selection for high-dimensional single-index varying-coefficient model

A Transparent and Nonlinear Method for Variable Selection

A High-dimensional M-estimator Framework for Bi-level Variable Selection

Variable Selection in High-Dimensional Quantile Varying Coefficient Models

Ensembling Variable Selectors by Stability Selection for the Cox Model

Variable Selection for Single-Index Models Based on Martingale Difference Divergence

Variable Selection for the Partial Linear Single-Index Model

Optimal Feature Selection in High-Dimensional Discriminant Analysis

A Two-Stage Variable Selection Approach for Correlated High Dimensional Predictors

Variable Selection for High-dimensional Cox Model with Error Rate Control

Variable Selection Via Thompson Sampling