Benign Overfitting of Non-Sparse High-Dimensional Linear Regression with Correlated Noise

Toshiki Tsuda,Masaaki Imaizumi
2023-10-20
Abstract:We investigate the high-dimensional linear regression problem in the presence of noise correlated with Gaussian covariates. This correlation, known as endogeneity in regression models, often arises from unobserved variables and other factors. It has been a major challenge in causal inference and econometrics. When the covariates are high-dimensional, it has been common to assume sparsity on the true parameters and estimate them using regularization, even with the endogeneity. However, when sparsity does not hold, it has not been well understood to control the endogeneity and high dimensionality simultaneously. This study demonstrates that an estimator without regularization can achieve consistency, that is, benign overfitting, under certain assumptions on the covariance matrix. Specifically, our results show that the error of this estimator converges to zero when the covariance matrices of correlated noise and instrumental variables satisfy a condition on their eigenvalues. We consider several extensions relaxing these conditions and conduct experiments to support our theoretical findings. As a technical contribution, we utilize the convex Gaussian minimax theorem (CGMT) in our dual problem and extend CGMT itself.
Statistics Theory
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to study the issue of overfitting in high-dimensional linear regression models when the noise is correlated with the covariates. Specifically, the authors focus on estimating parameters using the instrumental variable framework in high-dimensional non-sparse parameter settings and demonstrate that an unregularized (i.e., non-ridge type) estimator can achieve consistent estimation under certain conditions. #### Main Issues: 1. **Correlation between noise and covariates**: The paper explores how to perform effective estimation in high-dimensional linear regression when the noise is correlated with the covariates. This correlation is commonly referred to as endogeneity, which is a significant challenge in causal inference and econometrics. 2. **Non-sparse parameters**: The paper assumes that the true parameters are not sparse, meaning that most coordinates are non-zero. In this context, it investigates how to control for endogeneity and the simultaneous impact of high dimensionality. #### Research Contributions: 1. **Theoretical Analysis**: The paper demonstrates that under certain conditions on the covariance matrix, an unregularized estimator can achieve consistent estimation, i.e., benign overfitting. 2. **Condition Analysis**: The paper provides sufficient conditions on the data distribution that ensure the estimation error converges to zero. These conditions involve the effective rank of the covariance matrix of the instrumental variables. 3. **Technical Contributions**: The paper utilizes the Convex Gaussian Minimax Theorem (CGMT) to prove its results and extends the CGMT itself to accommodate the case of non-orthogonal covariance matrices. Through this research, the paper shows that in high-dimensional and non-sparse parameter settings, using the instrumental variable framework can effectively address the correlation between noise and covariates and achieve consistent estimation.