Abstract:Linear model prediction with a large number of potential predictors is both statistically and computationally challenging. The traditional approaches are largely based on shrinkage selection/estimation methods, which are applicable even when the number of potential predictors is (much) larger than the sample size. A situation of the latter scenario occurs when the candidate predictors involve many binary indicators corresponding to categories of some categorical predictors as well as their interactions. We propose an alternative approach to the shrinkage prediction methods in such a case based on mixed model prediction, which effectively treats combinations of the categorical effects as random effects. We establish theoretical validity of the proposed method, and demonstrate empirically its advantage over the shrinkage methods. We also develop measures of uncertainty for the proposed method and evaluate their performance empirically. A real-data example is considered.

What problem does this paper attempt to address?

This paper attempts to solve the problem of dealing with a large number of categorical predictor variables and their interactions in regression analysis. When the number of potential predictor variables (especially categorical variables and their interactions) far exceeds the sample size, the traditional least - squares method cannot be applied. Although existing methods based on shrinkage selection/estimation (such as Lasso, SCAD or Elastic Net) can partially solve the problem, they still have limitations in high - dimensional situations. Therefore, this paper proposes a new method based on Mixed Model Prediction (MMP). By regarding the combination of categorical effects as random effects, it reduces the high - dimensional problem and directly estimates the regression mean. This method is not only theoretically verified, but also shows advantages over existing shrinkage methods in practical applications. ### Main contributions of the paper: 1. **Proposing a new method**: Introduce the pseudo - MMP method based on mixed model prediction, which regards the combination of categorical variables and their interactions as random effects, thus effectively reducing the complexity of high - dimensional problems. 2. **Theoretical verification**: Establish the theoretical validity of the proposed method, and prove the convergence of pseudo - maximum likelihood estimates (pseudo MLEs) and the consistency and L2 - convergence of pseudo - EBLUP. 3. **Empirical research**: Through simulation experiments and real - data examples, show the advantages of the new method in prediction performance, especially when the sample size is small and the number of predictor variables is large. 4. **Uncertainty measurement**: Develop a method for evaluating the uncertainty of pseudo - EBLUP, and prove its effectiveness through experiments. ### Overview of the paper structure: - **Introduction**: Introduce the research background and motivation, and explain the importance of high - dimensional regression problems in modern data analysis. - **Method description**: Describe in detail the construction process of the pseudo - MMP method, including how to transform categorical variables and their interactions into random effects. - **Simulation experiment**: Compare the performance of the new method with existing shrinkage methods through simulated data, and show the advantages of the new method in different scenarios. - **Asymptotic theory**: Provide theoretical proofs of the convergence of pseudo - MLEs and the consistency and L2 - convergence of pseudo - EBLUP. - **Uncertainty measurement**: Propose a method for evaluating the uncertainty of pseudo - EBLUP, and prove its effectiveness through experiments. - **Real - data application**: Conduct a practical application using the bone marrow transplantation data set to further verify the effectiveness of the new method. - **Discussion and conclusion**: Summarize the main findings of the research and discuss future research directions. ### Key formulas: - **Regression model**: \[ y_i = b_0 + x_i' b+\sum_{j = 1}^q\sum_{k = 1}^{C_j}a_{jk}1(c_{ij}=k)+\epsilon_i,\quad i = 1,\ldots,N \] - **Regression mean**: \[ \theta_i = b_0 + x_i' b+\sum_{j = 1}^q\sum_{k = 1}^{C_j}a_{jk}1(c_{ij}=k),\quad i = 1,\ldots,N \] - **Pseudo - EBLUP**: \[ \hat{\theta}_i=\hat{b}_0 + x_i'\hat{b}+z_i'\hat{\alpha},\quad z_i'\hat{\alpha}=\frac{\hat{h}n_k}{1 + \hat{h}n_k}(\bar{y}_{k\cdot}-\hat{b}_0-\bar{x}_{k\cdot}'\hat{b}) \] Through these contributions, this paper provides a new and effective method for dealing with high - dimensional regression problems, especially when a large number of categorical variables and their interactions are involved.

A Random-effects Approach to Regression Involving Many Categorical Predictors and Their Interactions

Inferential Tools for Assessing Dependence Across Response Categories in Multinomial Models with Discrete Random Effects

Penalized Independence Rule for Testing High-Dimensional Hypotheses

A binarization approach to model interactions between categorical predictors in Generalized Linear Models

Scalable Estimation of Multinomial Response Models with Random Consideration Sets

Regression-based multiple treatment effect estimation under covariate-adaptive randomization

Interaction Selection and Prediction Performance in High-Dimensional Data: A Comparative Study of Statistical and Tree-Based Methods

Modelling correlated ordinal data by random-effects logistic regression models: simulation and application

Leveraging independence in high-dimensional mixed linear regression

Macrophage Migration Inhibitory Factor and Interleukin-8 Produced by Gastric Epithelial Cells during Helicobacter pylori Exposure Induce Expression and Activation of the Epidermal Growth Factor Receptor

Spline Regression in the Presence of Categorical Predictors

Random-effect Based Test for Multinomial Logistic Regression: Choice of the Reference Level and Its Impact on the Testing

Shrinkage for Categorical Regressors

Bayesian Regression Analysis of Data with Random Effects Covariates from Nonlinear Longitudinal Measurements

A reduced-rank approach to predicting multiple binary responses through machine learning

Ultra-High Dimensional Model Averaging for Multi-Categorical Response

A general theory of regression adjustment for covariate-adaptive randomization: OLS, Lasso, and beyond

Penalized Regression Adjusted Causal Effect Estimates in High Dimensional Randomized Experiments

Selection of Regression Models under Linear Restrictions for Fixed and Random Designs

A General Framework for Random Effects Models for Binary, Ordinal, Count Type and Continuous Dependent Variables Including Variable Selection

Detection of latent heteroscedasticity and group-based regression effects in linear models via Bayesian model selection