Abstract:A fundamental problem in machine learning is understanding the effect of early stopping on the parameters obtained and the generalization capabilities of the model. Even for linear models, the effect is not fully understood for arbitrary learning rates and data. In this paper, we analyze the dynamics of discrete full batch gradient descent for linear regression. With minimal assumptions, we characterize the trajectory of the parameters and the expected excess risk. Using this characterization, we show that when training with a learning rate schedule $\eta_k$, and a finite time horizon $T$, the early stopped solution $\beta_T$ is equivalent to the minimum norm solution for a generalized ridge regularized problem. We also prove that early stopping is beneficial for generic data with arbitrary spectrum and for a wide variety of learning rate schedules. We provide an estimate for the optimal stopping time and empirically demonstrate the accuracy of our estimate.

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is to understand the influence of early stopping on model parameters and generalization ability in least - squares regression. Specifically, the paper focuses on the following points: 1. **Nature of early stopping**: What are the characteristics of the model obtained by early stopping? 2. **When early stopping is suitable**: Under what circumstances is early stopping beneficial? 3. **How to determine the time to stop training**: How to decide the optimal stopping time? ### Main contributions of the paper 1. **Exact trajectory formula**: - Provide the exact expressions of the parameter $\beta_k$ during the gradient descent process. These expressions are applicable to any data, learning rate schedule, and noise distribution. - For common learning rate schedules (such as constant learning rate, polynomial decay, etc.), specific expressions are given. 2. **Equivalence with generalized ridge regression**: - Prove that for general data and learning rate schedules, early stopping is equivalent to the minimum - norm solution of generalized ridge regression. - Show that any minimum - norm ridge regression solution can also be obtained by early stopping, provided that different learning rates can be selected in each feature space. 3. **Sufficient conditions for early stopping to be beneficial**: - Give sufficient conditions for early stopping to improve generalization performance. - For many common learning rate schedules, show that early stopping is also beneficial independently of the input data distribution. - Also provide sufficient conditions for early stopping to be unbeneficial. 4. **Optimal stopping time estimation**: - Provide optimal stopping time estimates for general data and a large class of learning rate schedules. - Numerically verify the accuracy of this estimate and generalize previous research results that were previously limited to well - conditioned covariance matrices and constant step sizes. ### Key formulas in theoretical derivation - **Gradient descent update formula**: \[ \beta_{k + 1}=\beta_k-\eta_{k + 1}\left(\frac{1}{n}X^T(X\beta_k - y)+\lambda\beta_k\right) \] - **Overfitting risk formula**: \[ E_\epsilon[R(\beta_k)]=\left\|\Sigma^{1/2}V\Phi(k, 0)(\tilde{\beta}_0-\tilde{\beta}^*)\right\|_2^2+\tau^2\left\|\Sigma^{1/2}V(I - \Phi(k, 0))\Sigma^\dagger X\right\|_F^2 \] - **Optimal stopping time estimate**: \[ \sum_{i = 1}^k\eta_i\approx\frac{\log\left(\frac{\sigma^2}{\tau^2\Lambda_{jj}}+ 1\right)}{\Lambda_{jj}} \] ### Conclusion By analyzing the dynamics of discrete gradient descent, the paper provides a theoretical basis for early stopping, showing that early stopping is similar to $L_2$ regularization and can be used as a regularization method to improve the generalization performance of the model. In addition, the paper also provides sufficient conditions for early stopping to be beneficial and estimates of the optimal stopping time, and verifies the accuracy of these theoretical results through experiments.

On Regularization via Early Stopping for Least Squares Regression

Early stopping and polynomial smoothing in regression with reproducing kernels

A Statistical Theory of Regularization-Based Continual Learning

Implicit Sparse Regularization: The Impact of Depth and Early Stopping

Optimal learning rates for Kernel Conjugate Gradient regression

Stable Minima Cannot Overfit in Univariate ReLU Networks: Generalization by Large Step Sizes

Early Stopping of Untrained Convolutional Neural Networks

A generalization of regularized dual averaging and its dynamics

Asymptotics of Stochastic Gradient Descent with Dropout Regularization in Linear Models

The Implicit Regularization of Dynamical Stability in Stochastic Gradient Descent.

Discrete error dynamics of mini-batch gradient descent for least squares regression

Batches Stabilize the Minimum Norm Risk in High Dimensional Overparameterized Linear Regression

Regularization properties of adversarially-trained linear regression

Uncertainty quantification for iterative algorithms in linear models with application to early stopping

The Risk of Machine Learning

High-Dimensional Linear Regression via Implicit Regularization

Provably Auditing Ordinary Least Squares in Low Dimensions

Convergence Conditions of Online Regularized Statistical Learning in Reproducing Kernel Hilbert Space With Non-Stationary Data

Dropout Regularization Versus $\ell_2$-Penalization in the Linear Model

Ridge regularization for Mean Squared Error Reduction in Regression with Weak Instruments

Learning rates for regularized least squares ranking algorithm