The use of the EM algorithm for regularization problems in high-dimensional linear mixed-effects models

Daniela C. R. Oliveira,Fernanda L. Schumacher,Victor H. Lachos
DOI: https://doi.org/10.48550/arXiv.2308.01518
2023-08-03
Abstract:The EM algorithm is a popular tool for maximum likelihood estimation but has not been used much for high-dimensional regularization problems in linear mixed-effects models. In this paper, we introduce the EMLMLasso algorithm, which combines the EM algorithm and the popular and efficient R package glmnet for Lasso variable selection of fixed effects in linear mixed-effects models. We compare the performance of our proposed EMLMLasso algorithm with the one implemented in the well-known R package glmmLasso through the analyses of both simulated and real-world applications. The simulations and applications demonstrated good properties, such as consistency, and the effectiveness of the proposed variable selection procedure, for both $p < n$ and $p > n$. Moreover, in all evaluated scenarios, the EMLMLasso algorithm outperformed glmmLasso. The proposed method is quite general and can be easily extended for ridge and elastic net penalties in linear mixed-effects models.
Methodology
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the regularization problem in high-dimensional linear mixed-effects models (LMM). Specifically, it focuses on how to effectively select fixed effect variables when the number of predictors \( p \) is greater than the number of observations \( n \). This issue is very common in many practical applications, especially in fields such as genetics, health, finance, ecology, and image processing. Although some statistical methods have been proposed for variable selection, selecting fixed effects in the context of high-dimensional data remains a challenge. ### Background and Motivation 1. **Linear Mixed-Effects Models (LMM)**: - LMM is a class of statistical models used to describe the relationship between response variables and covariates, particularly suitable for clustered or longitudinal data. - With the increase in data volume, LMM has become increasingly important in many fields. 2. **High-Dimensional Data**: - When the number of predictors \( p \) is much greater than the number of observations \( n \), it is referred to as the high-dimensional variable selection problem. - Even with the continuous advancement of computational, statistical, and technological tools, selecting fixed effects in high-dimensional data remains a difficult problem. 3. **Existing Methods**: - Some existing methods include penalized maximum likelihood estimation (PML) based on L1 penalty, which have been applied in some studies. - However, the performance of these methods in high-dimensional data is not always satisfactory. ### Proposed Method The authors propose the EMLMLasso algorithm, which combines the EM algorithm and the Lasso variable selection method from the R package `glmnet`, for selecting fixed effects in high-dimensional linear mixed-effects models. The specific steps are as follows: 1. **Initialization**: - Set initial parameter values, including fixed effect coefficients \( \beta \), random effect variance \( \sigma^2 \), and random effect covariance matrix \( D \). 2. **E Step**: - Calculate the conditional expectation of the complete data log-likelihood function, considering the current parameter estimates. 3. **M Step**: - Update the parameter values by maximizing the conditional expectation function. 4. **Tuning Parameter Selection**: - Use the Bayesian Information Criterion (BIC) to select the optimal tuning parameter \( \lambda \). ### Experimental Results 1. **Simulation Experiments**: - The authors validated the effectiveness of the EMLMLasso algorithm through simulation experiments and compared it with the existing glmmLasso algorithm. - The results show that under different scenarios, the EMLMLasso algorithm outperforms the glmmLasso algorithm in terms of variable selection and parameter estimation. 2. **Real Data Applications**: - The authors applied the EMLMLasso algorithm to two real datasets: the Framingham cholesterol data and the riboflavin production gene data. - The results indicate that the EMLMLasso algorithm performs well on both datasets. ### Conclusion This paper proposes a new EMLMLasso algorithm that combines the EM algorithm and Lasso penalty to effectively solve the problem of fixed effect selection in high-dimensional linear mixed-effects models. Through simulation experiments and real data applications, the superior performance of this algorithm in the context of high-dimensional data is demonstrated.