On Regularization via Early Stopping for Least Squares Regression

Rishi Sonthalia,Jackie Lok,Elizaveta Rebrova
2024-06-07
Abstract:A fundamental problem in machine learning is understanding the effect of early stopping on the parameters obtained and the generalization capabilities of the model. Even for linear models, the effect is not fully understood for arbitrary learning rates and data. In this paper, we analyze the dynamics of discrete full batch gradient descent for linear regression. With minimal assumptions, we characterize the trajectory of the parameters and the expected excess risk. Using this characterization, we show that when training with a learning rate schedule $\eta_k$, and a finite time horizon $T$, the early stopped solution $\beta_T$ is equivalent to the minimum norm solution for a generalized ridge regularized problem. We also prove that early stopping is beneficial for generic data with arbitrary spectrum and for a wide variety of learning rate schedules. We provide an estimate for the optimal stopping time and empirically demonstrate the accuracy of our estimate.
Machine Learning,Optimization and Control,Statistics Theory
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is to understand the influence of early stopping on model parameters and generalization ability in least - squares regression. Specifically, the paper focuses on the following points: 1. **Nature of early stopping**: What are the characteristics of the model obtained by early stopping? 2. **When early stopping is suitable**: Under what circumstances is early stopping beneficial? 3. **How to determine the time to stop training**: How to decide the optimal stopping time? ### Main contributions of the paper 1. **Exact trajectory formula**: - Provide the exact expressions of the parameter \(\beta_k\) during the gradient descent process. These expressions are applicable to any data, learning rate schedule, and noise distribution. - For common learning rate schedules (such as constant learning rate, polynomial decay, etc.), specific expressions are given. 2. **Equivalence with generalized ridge regression**: - Prove that for general data and learning rate schedules, early stopping is equivalent to the minimum - norm solution of generalized ridge regression. - Show that any minimum - norm ridge regression solution can also be obtained by early stopping, provided that different learning rates can be selected in each feature space. 3. **Sufficient conditions for early stopping to be beneficial**: - Give sufficient conditions for early stopping to improve generalization performance. - For many common learning rate schedules, show that early stopping is also beneficial independently of the input data distribution. - Also provide sufficient conditions for early stopping to be unbeneficial. 4. **Optimal stopping time estimation**: - Provide optimal stopping time estimates for general data and a large class of learning rate schedules. - Numerically verify the accuracy of this estimate and generalize previous research results that were previously limited to well - conditioned covariance matrices and constant step sizes. ### Key formulas in theoretical derivation - **Gradient descent update formula**: \[ \beta_{k + 1}=\beta_k-\eta_{k + 1}\left(\frac{1}{n}X^T(X\beta_k - y)+\lambda\beta_k\right) \] - **Overfitting risk formula**: \[ E_\epsilon[R(\beta_k)]=\left\|\Sigma^{1/2}V\Phi(k, 0)(\tilde{\beta}_0-\tilde{\beta}^*)\right\|_2^2+\tau^2\left\|\Sigma^{1/2}V(I - \Phi(k, 0))\Sigma^\dagger X\right\|_F^2 \] - **Optimal stopping time estimate**: \[ \sum_{i = 1}^k\eta_i\approx\frac{\log\left(\frac{\sigma^2}{\tau^2\Lambda_{jj}}+ 1\right)}{\Lambda_{jj}} \] ### Conclusion By analyzing the dynamics of discrete gradient descent, the paper provides a theoretical basis for early stopping, showing that early stopping is similar to \(L_2\) regularization and can be used as a regularization method to improve the generalization performance of the model. In addition, the paper also provides sufficient conditions for early stopping to be beneficial and estimates of the optimal stopping time, and verifies the accuracy of these theoretical results through experiments.