Abstract:Statistical inference on the explained variation of an outcome by a set of covariates is of particular interest in practice. When the covariates are of moderate to high-dimension and the effects are not sparse, several approaches have been proposed for estimation and inference. One major problem with the existing approaches is that the inference procedures are not robust to the normality assumption on the covariates and the residual errors. In this paper, we propose an estimating equation approach to the estimation and inference on the explained variation in the high-dimensional linear model. Unlike the existing approaches, the proposed approach does not rely on the restrictive normality assumptions for inference. It is shown that the proposed estimator is consistent and asymptotically normally distributed under reasonable conditions. Simulation studies demonstrate better performance of the proposed inference procedure in comparison with the existing approaches. The proposed approach is applied to studying the variation of glycohemoglobin explained by environmental pollutants in a National Health and Nutrition Examination Survey data set.
What problem does this paper attempt to address?
This paper aims to solve the problem of estimating and inferring the amount of variation explained by a set of covariates in high - dimensional linear models, especially when the covariate effects are not sparse. Existing methods usually rely on the assumption of normal distribution of covariates and residual errors, which may not hold in practical applications, especially for environmental pollutant data. Therefore, these methods perform poorly when the covariate effects are dense and non - normally distributed.
### Main contributions of the paper:
1. **Propose a new method of estimating equations**: This method is used to estimate and infer the amount of explained variation in high - dimensional linear models without relying on strict assumptions of normal distribution.
2. **Prove the consistency and asymptotic normality of the new estimator**: Under reasonable conditions, the newly proposed estimator is consistent and asymptotically normally distributed.
3. **Verify the superiority of the new method through simulation studies**: Compared with existing methods, the new method shows better performance in dealing with non - normal data.
4. **Apply to real - world data sets**: Apply the new method to the National Health and Nutrition Examination Survey (NHANES) data set to study the impact of environmental pollutants on the variation of glycated hemoglobin.
### Specific problem description:
- **Background**: In scientific research, it is very important to estimate and infer the amount of variation explained by a set of covariates, such as heritability in genetics research and signal - to - noise ratio in wireless communication.
- **Problem**: When the covariate dimension is high and the effects are not sparse, existing methods often perform poorly, especially when the covariates and residual errors do not satisfy the assumption of normal distribution.
- **Solution**: The paper proposes a new method based on estimating equations, which does not rely on the assumption of normal distribution and can more accurately estimate and infer the amount of explained variation in high - dimensional data.
### Key formulas:
- **Explained variation \( r^2 \)**:
\[
r^2=\frac{\beta^T\Sigma\beta}{\beta^T\Sigma\beta+\sigma^2_{\epsilon}}
\]
where \(\beta\) is the vector of regression coefficients, \(\Sigma\) is the covariance matrix of covariates, and \(\sigma^2_{\epsilon}\) is the variance of the residual error.
- **Estimating equation**:
\[
\hat{r}^2 = \frac{\text{tr}\left[W \left(\frac{1}{\hat{\sigma}^2_Y} \tilde{Y} \tilde{Y}^T - (I - \frac{1_n 1_n^T}{n})\right)\right]}{\text{tr}\left[W \left(M - (I - \frac{1_n 1_n^T}{n})\right)\right]}
\]
where \(\tilde{Y} = Y - \frac{1_n \bar{Y}}{n}\), \(M=\frac{1}{p}(Z - \frac{1_n \bar{Z}^T}{n})(Z - \frac{1_n \bar{Z}^T}{n})^T\), \(\hat{\sigma}^2_Y\) is the estimated value of the variance of \(Y\), and \(W\) is the weight matrix.
### Simulation study results:
- **Coverage rate and confidence interval length**: Under different sample sizes and covariate distributions, the confidence interval coverage rate of the new method is close to the nominal level, and the length of the confidence interval is reasonable.
- **Comparison with other methods**: The new method performs better than existing methods in dealing with non - normal data, especially when the sample size is larger than the covariate dimension.
### Practical applications:
- **NHANES data set**: Apply the new method to study the impact of environmental pollutants (such as polychlorinated biphenyls) on the variation of glycated hemoglobin, and the results show that the new method can effectively estimate the amount of explained variation.
In conclusion, this paper solves the problem of estimating and inferring the amount of explained variation in high - dimensional linear models by proposing a new method of estimating equations, especially when the covariate effects are dense and non - normally distributed.