A Random-effects Approach to Regression Involving Many Categorical Predictors and Their Interactions

Hanmei Sun,Jiangshan Zhang,Jiming Jiang
2024-09-14
Abstract:Linear model prediction with a large number of potential predictors is both statistically and computationally challenging. The traditional approaches are largely based on shrinkage selection/estimation methods, which are applicable even when the number of potential predictors is (much) larger than the sample size. A situation of the latter scenario occurs when the candidate predictors involve many binary indicators corresponding to categories of some categorical predictors as well as their interactions. We propose an alternative approach to the shrinkage prediction methods in such a case based on mixed model prediction, which effectively treats combinations of the categorical effects as random effects. We establish theoretical validity of the proposed method, and demonstrate empirically its advantage over the shrinkage methods. We also develop measures of uncertainty for the proposed method and evaluate their performance empirically. A real-data example is considered.
Methodology,Statistics Theory
What problem does this paper attempt to address?
This paper attempts to solve the problem of dealing with a large number of categorical predictor variables and their interactions in regression analysis. When the number of potential predictor variables (especially categorical variables and their interactions) far exceeds the sample size, the traditional least - squares method cannot be applied. Although existing methods based on shrinkage selection/estimation (such as Lasso, SCAD or Elastic Net) can partially solve the problem, they still have limitations in high - dimensional situations. Therefore, this paper proposes a new method based on Mixed Model Prediction (MMP). By regarding the combination of categorical effects as random effects, it reduces the high - dimensional problem and directly estimates the regression mean. This method is not only theoretically verified, but also shows advantages over existing shrinkage methods in practical applications. ### Main contributions of the paper: 1. **Proposing a new method**: Introduce the pseudo - MMP method based on mixed model prediction, which regards the combination of categorical variables and their interactions as random effects, thus effectively reducing the complexity of high - dimensional problems. 2. **Theoretical verification**: Establish the theoretical validity of the proposed method, and prove the convergence of pseudo - maximum likelihood estimates (pseudo MLEs) and the consistency and L2 - convergence of pseudo - EBLUP. 3. **Empirical research**: Through simulation experiments and real - data examples, show the advantages of the new method in prediction performance, especially when the sample size is small and the number of predictor variables is large. 4. **Uncertainty measurement**: Develop a method for evaluating the uncertainty of pseudo - EBLUP, and prove its effectiveness through experiments. ### Overview of the paper structure: - **Introduction**: Introduce the research background and motivation, and explain the importance of high - dimensional regression problems in modern data analysis. - **Method description**: Describe in detail the construction process of the pseudo - MMP method, including how to transform categorical variables and their interactions into random effects. - **Simulation experiment**: Compare the performance of the new method with existing shrinkage methods through simulated data, and show the advantages of the new method in different scenarios. - **Asymptotic theory**: Provide theoretical proofs of the convergence of pseudo - MLEs and the consistency and L2 - convergence of pseudo - EBLUP. - **Uncertainty measurement**: Propose a method for evaluating the uncertainty of pseudo - EBLUP, and prove its effectiveness through experiments. - **Real - data application**: Conduct a practical application using the bone marrow transplantation data set to further verify the effectiveness of the new method. - **Discussion and conclusion**: Summarize the main findings of the research and discuss future research directions. ### Key formulas: - **Regression model**: \[ y_i = b_0 + x_i' b+\sum_{j = 1}^q\sum_{k = 1}^{C_j}a_{jk}1(c_{ij}=k)+\epsilon_i,\quad i = 1,\ldots,N \] - **Regression mean**: \[ \theta_i = b_0 + x_i' b+\sum_{j = 1}^q\sum_{k = 1}^{C_j}a_{jk}1(c_{ij}=k),\quad i = 1,\ldots,N \] - **Pseudo - EBLUP**: \[ \hat{\theta}_i=\hat{b}_0 + x_i'\hat{b}+z_i'\hat{\alpha},\quad z_i'\hat{\alpha}=\frac{\hat{h}n_k}{1 + \hat{h}n_k}(\bar{y}_{k\cdot}-\hat{b}_0-\bar{x}_{k\cdot}'\hat{b}) \] Through these contributions, this paper provides a new and effective method for dealing with high - dimensional regression problems, especially when a large number of categorical variables and their interactions are involved.