Abstract:Mixed-effect models are very popular for analyzing data with a hierarchical structure, e.g. repeated observations within subjects in a longitudinal design, patients nested within centers in a multicenter design. However, recently, due to the medical advances, the number of fixed effect covariates collected from each patient can be quite large, e.g. data on gene expressions of each patient, and all of these variables are not necessarily important for the outcome. So, it is very important to choose the relevant covariates correctly for obtaining the optimal inference for the overall study. On the other hand, the relevant random effects will often be low-dimensional and pre-specified. In this paper, we consider regularized selection of important fixed effect variables in linear mixed-effects models along with maximum penalized likelihood estimation of both fixed and random effect parameters based on general non-concave penalties. Asymptotic and variable selection consistency with oracle properties are proved for low-dimensional cases as well as for high-dimensionality of non-polynomial order of sample size (number of parameters is much larger than sample size). We also provide a suitable computationally efficient algorithm for implementation. Additionally, all the theoretical results are proved for a general non-convex optimization problem that applies to several important situations well beyond the mixed model set-up (like finite mixture of regressions etc.) illustrating the huge range of applicability of our proposal.
What problem does this paper attempt to address?
This paper attempts to solve the problem of fixed - effect variable selection in linear mixed - effect models in high - dimensional datasets. Specifically, with the progress of medical research, a large number of fixed - effect covariates (such as gene expression data) can be collected for each patient, but not all of these variables are important for the research results. Therefore, the correct selection of relevant covariates is crucial for obtaining the best inference in the overall research. However, most of the existing methods are limited to the classical low - dimensional setting (i.e., the sample size is larger than the number of parameters) and perform poorly in modern high - dimensional datasets (the number of parameters is much larger than the sample size).
To solve this problem, the paper proposes a regularization selection method using general non - concave penalty functions to estimate both fixed - effect and random - effect parameters simultaneously. This method is applicable not only to the classical low - dimensional case but also to high - dimensional datasets. Through this method, the paper proves the consistency of the maximum penalized likelihood estimators (MPLEs) and the oracle property of variable selection in both low - and high - dimensional cases, and provides a computationally efficient algorithm to achieve this goal.
### Main contributions
1. **Asymptotic theory of general non - convex loss functions and non - concave penalty functions**:
- The paper provides the asymptotic theory of maximum penalized likelihood estimation under general non - convex loss functions and non - concave penalty functions, including the classical low - dimensional case (\(P < n\)) and the high - dimensional case (\(P \gg n\)). This general theory is applicable to a variety of non - standard statistical models, such as finite - mixture regression models, etc., expanding the scope of the existing literature.
2. **Asymptotic distribution of high - dimensional linear mixed - effect models**:
- In the high - dimensional case, the paper provides the asymptotic distribution of the maximum penalized likelihood estimator using general non - concave penalty functions, which is not covered in the existing literature.
3. **Application advantages**:
- The paper shows through simulation and real - data examples that using the SCAD penalty function in linear mixed - effect models is superior to the traditional L1 penalty function (LASSO) in the selection and estimation of fixed - effect variables. This provides important guidance for practical applications.
### Method overview
- **Non - concave penalty functions**:
- Use non - concave penalty functions (such as SCAD) to select important fixed - effect variables. Non - concave penalty functions have excellent properties such as unbiasedness, sparsity, and continuity, and can effectively select variables in high - dimensional datasets.
- **Maximum penalized likelihood estimation**:
- Simultaneously estimate fixed - effect and random - effect parameters by minimizing the negative log - likelihood function containing non - concave penalty terms. Since this is a non - convex optimization problem, the paper proposes some suitable quadratic approximations and iterative algorithms to solve this problem.
### Theoretical results
- **Consistency**:
- Prove the consistency of the maximum penalized likelihood estimator and the oracle property of variable selection in both low - and high - dimensional cases.
- **Asymptotic distribution**:
- Provide the asymptotic distribution of the estimators of fixed - effect and random - effect parameters, which is very important for estimating standard errors and variance parameters.
### Application examples
- **Simulation and real - data**:
- Verify the effectiveness of the proposed method through simulation experiments and real - data, especially in high - dimensional datasets, the performance of the SCAD penalty function is better than that of the L1 penalty function.
In summary, this paper solves the problem of fixed - effect variable selection in high - dimensional linear mixed - effect models by introducing non - concave penalty functions and the maximum penalized likelihood estimation method, and provides theoretical support and practical application guidance.