Insha Ullah,A.H. Welsh
Abstract:In this study, we explore the effects of including noise predictors and noise observations when fitting linear regression models. We present empirical and theoretical results that show that double descent occurs in both cases, albeit with contradictory implications: the implication for noise predictors is that complex models are often better than simple ones, while the implication for noise observations is that simple models are often better than complex ones. We resolve this contradiction by showing that it is not the model complexity but rather the implicit shrinkage by the inclusion of noise in the model that drives the double descent. Specifically, we show how noise predictors or observations shrink the estimators of the regression coefficients and make the test error asymptote, and then how the asymptotes of the test error and the ``condition number anomaly'' ensure that double descent occurs. We also show that including noise observations in the model makes the (usually unbiased) ordinary least squares estimator biased and indicates that the ridge regression estimator may need a negative ridge parameter to avoid over-shrinkage.
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve
This paper aims to explore the impact of adding noise predictors and noise observations on the performance of linear regression models. Specifically, the authors empirically and theoretically demonstrate the double descent phenomenon in two cases, despite the contradictory implications of these cases:
1. **Noise Predictors**: Complex models are generally better than simple models.
2. **Noise Observations**: Simple models are generally better than complex models.
The authors further explain this contradiction, pointing out that it is not the complexity of the model itself that leads to double descent, but rather the implicit shrinkage caused by the inclusion of noise in the model. Specifically, noise predictors or observations shrink the estimated regression coefficients, causing the test error to asymptotically stabilize. Additionally, the authors discuss how noise observations can make the usually unbiased least squares estimator biased and suggest that in some cases, the ridge regression estimator may require a negative ridge parameter to avoid over-shrinkage.
### Main Research Findings
1. **Shrinkage Effect**:
- Adding noise predictors or noise observations causes the estimator to shrink to zero, and as \(d \to \infty\) or \(n \to \infty\), the test error asymptotically converges to the same value.
- This result is crucial for understanding the double descent phenomenon, as the convergence of test error at zero and infinity, combined with the "condition number anomaly," jointly drives the occurrence of the double descent phenomenon.
2. **Ridge Regression Estimator**:
- For sequences I and II, the estimator can be approximately regarded as a ridge regression estimator.
- Particularly for sequence II, the ridge regression estimator applied to the augmented data can also be approximately regarded as a "double shrinkage" ridge regression estimator. This means that when \(n \to \infty\), even if the predictors are independent and of lower dimension, the optimal ridge parameter \( \lambda \) may be negative.
### Experimental Results
1. **Increasing Noise Predictors**:
- Experiments observed that as \(d\) increases, the test error exhibits a second descent, but it does not always correspond to the optimal model.
- When \(d_0\) is close to the interpolation point \(d = n\), the optimal minimum test error usually appears after the second descent.
- When \(d_0 > n\), the second descent in the over-parameterized region leads to the global minimum test error.
2. **Increasing Noise Observations**:
- As \(n\) increases, the changes in training and test errors indicate that noise observations lead to over-shrinkage.
- In the over-parameterized region, appropriate noise can achieve the global minimum test error.
- When the signal-to-noise ratio is low, more noise is needed to achieve the minimum test error.
### Theoretical Results
1. **Estimator**:
- By constructing two training data sequences \(D(d)\) and \(D(n)\), linear regression models are fitted respectively.
- It is theoretically proven that adding noise predictors or noise observations causes the estimator to shrink to zero as \(d\) or \(n\) increases.
2. **Test Error**:
- The performance of the fitted model is evaluated through test error, and both empirical and theoretical analyses show that noise predictors or observations cause a shrinkage effect on the test error, ultimately leading to the asymptotic stabilization of the test error.
### Conclusion
Through empirical and theoretical analysis, this paper reveals the important role of noise in fitting linear regression models. The double descent phenomenon is not directly driven by over-parameterization but by the shrinkage effect caused by noise. Therefore, choosing the correct shrinkage method is more important than precisely fitting sparse or over-parameterized models.