Abstract:This article considers high-dimensional regression problems in which the number of predictors p exceeds the sample size n. We develop a model-averaging procedure for high-dimensional regression problems. Unlike most variable selection studies featuring the identification of true predictors, our focus here is on the prediction accuracy for the true conditional mean of y given the p predictors. Our method consists of two steps. The first step is to construct a class of regression models, each with a smaller number of regressors, to avoid the degeneracy of the information matrix. The second step is to find suitable model weights for averaging. To minimize the prediction error, we estimate the model weights using a delete-one cross-validation procedure. Departing from the literature of model averaging that requires the weights always sum to one, an important improvement we introduce is to remove this constraint. We derive some theoretical results to justify our procedure. A theorem is proved, showing that delete-one cross-validation achieves the lowest possible prediction loss asymptotically. This optimality result requires a condition that unravels an important feature of high-dimensional regression. The prediction error of any individual model in the class for averaging is required to be higher than the classic root n rate under the traditional parametric regression. This condition reflects the difficulty of high-dimensional regression and it depicts a situation especially meaningful for p > n. We also conduct a simulation study to illustrate the merits of the proposed approach over several existing methods, including lasso, group lasso, forward regression, Phase Coupled (PC)-simple algorithm, Akaike information criterion (AIC) model-averaging, Bayesian information criterion (BIC) model-averaging methods, and SCAD (smoothly clipped absolute deviation). This approach uses quadratic programming to overcome the computing time issue commonly encountered in the cross-validation literature. Supplementary materials for this article are available online.

What problem does this paper attempt to address?

This paper attempts to address the issue of prediction accuracy in high - dimensional regression problems, especially when the number of predictor variables \(p\) exceeds the sample size \(n\). Specifically, the author proposes a new model - averaging method to improve the prediction accuracy of the true conditional mean \(y\) given \(p\) predictor variables. This method is different from most variable - selection studies, which mainly focus on identifying the true predictor variables, but focuses on how to improve prediction accuracy through model - averaging. ### Main contributions: 1. **Algorithm Feasibility**: The proposed algorithm is computationally feasible even in the presence of thousands of covariates. 2. **Relaxing Weight Constraints**: For the first time, the standard restriction that the sum of model weights must equal 1 is removed, demonstrating the importance of this relaxation for improving prediction performance. 3. **Theoretical Analysis**: Theoretical results are provided, proving that minimizing the cross - validation criterion can asymptotically minimize the squared error between the true mean and the predicted value, with an "oracle" property similar to that of Li (1986, 1987) and Shao (1997) in the context of model selection. 4. **Unique Challenges in High - Dimensional Regression**: Theoretical results reveal an important distinction, namely that the prediction error of any individual model used for averaging must be higher than the classical root - \(n\) rate in traditional parametric regression, which reflects the difficulties encountered in high - dimensional regression, especially in the case of \(p>n\). 5. **Setting the Number of Models**: A practical method is proposed to address the problem of how to set the number of models to be averaged, and simulation studies show that this method has higher prediction accuracy than many existing methods such as LASSO, group LASSO, partial fidelity method, AIC model - averaging, BIC model - averaging method and SCAD. ### Method Steps: 1. **Prepare Candidate Models**: First, construct a set of regression models, each containing a smaller number of regressors to avoid the degeneracy of the information matrix. 2. **Optimize Model Weights**: Use the delete - one cross - validation method to determine the model weights to minimize the prediction error. Unlike traditional model - averaging methods, this paper allows the sum of weights to not equal 1, thereby improving prediction performance. ### Theoretical Results: - **Asymptotic Optimality**: Under certain assumptions, it is proved that the delete - one cross - validation method can asymptotically reach the lowest possible prediction loss. - **Condition Analysis**: The significance of these assumptions is discussed in detail, especially in the context of high - dimensional regression, and a reasonable upper limit for the number of models \(M\) is proposed. ### Simulation Studies: - Through simulation studies, the performance of the proposed method is compared with that of other existing methods, and the results show that this method has a significant advantage in prediction accuracy. In conclusion, this paper effectively solves the prediction problem in high - dimensional regression by proposing a new model - averaging method and has achieved remarkable results both theoretically and practically.

A Model-Averaging Approach for High-Dimensional Regression

On High-Dimensional Asymptotic Properties of Model Averaging Estimators

Sequential Model Averaging for High Dimensional Linear Regression Models

A Scalable Frequentist Model Averaging Method

Model Averaging Estimation for Nonparametric Varying-Coefficient Models with Multiplicative Heteroscedasticity

Ultra-High Dimensional Model Averaging for Multi-Categorical Response

Partial Linear Model Averaging Prediction for Longitudinal Data

Rank-Based Greedy Model Averaging for High-Dimensional Survival Data

Jackknife Model Averaging for Additive Expectile Prediction

Penalized Time-Varying Model Averaging

Robust Bayesian Model Averaging for Linear Regression Models With Heavy-Tailed Errors

Variable Screening and Model Averaging for Expectile Regressions

Model Averaging for Prediction with Fragmentary Data

Model averaging prediction by K -fold cross-validation

Model Averaging for Generalized Linear Model with Covariates that are Missing completely at Random

Parsimonious Model Averaging With a Diverging Number of Parameters

Model averaging for multivariate multiple regression models

High-dimensional regression in practice: an empirical study of finite-sample prediction, variable selection and ranking

Model Averaging for Estimating Treatment Effects With Binary Responses

Two-step estimation of high dimensional additive models

Jackknife Model Averaging for Mixed-Data Kernel-Weighted Spline Quantile Regressions