Abstract:We consider stochastic gradient descent (SGD) for least-squares regression with potentially several passes over the data. While several passes have been widely reported to perform practically better in terms of predictive performance on unseen data, the existing theoretical analysis of SGD suggests that a single pass is statistically optimal. While this is true for low-dimensional easy problems, we show that for hard problems, multiple passes lead to statistically optimal predictions while single pass does not; we also show that in these hard models, the optimal number of passes over the data increases with sample size. In order to define the notion of hardness and show that our predictive performances are optimal, we consider potentially infinite-dimensional models and notions typically associated to kernel methods, namely, the decay of eigenvalues of the covariance matrix of the features and the complexity of the optimal predictor as measured through the covariance matrix. We illustrate our results on synthetic experiments with non-linear kernel methods and on a classical benchmark with a linear model.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in the least - squares regression problem, whether the stochastic gradient descent (SGD) with a single pass over the data is statistically optimal in all cases. Existing theoretical analyses show that in low - dimensional simple problems, SGD with a single pass over the data can indeed achieve statistically optimal performance. However, in practical applications, multiple passes over the data usually bring better generalization performance, especially when dealing with high - dimensional complex problems. Specifically, the paper explores the following questions: - In what types of "difficult" problems can SGD with multiple passes over the data achieve statistically optimal prediction performance, while a single pass cannot. - For these difficult problems, how does the number of passes over the data change as the sample size increases. To define and characterize these problems, the authors introduce tools in infinite - dimensional models and use concepts common in kernel methods, such as the eigenvalue decay rate of the feature covariance matrix and the complexity of the optimal predictor. Using these tools, the authors prove that for certain "difficult" problems, SGD with multiple passes over the data can indeed achieve statistically optimal performance, while a single pass cannot. In addition, the authors also show that in these difficult problems, as the sample size increases, the number of passes over the data also increases. ### Formula Summary 1. **Eigenvalue Decay**: \[ \lambda_m = O(m^{-\alpha}), \quad \alpha \geq 1 \] Here, \(\alpha\) represents the size of the feature space, \(\alpha = 1\) corresponds to the largest feature space, and \(\alpha = +\infty\) corresponds to a finite - dimensional space. 2. **Complexity of the Optimal Predictor**: \[ \langle \theta^*, \Sigma^{1 - 2r} \theta^* \rangle \text{ small} \] The parameter \(r\geq0\) represents the difficulty of the learning problem, \(r = 1/2\) corresponds to measuring the complexity of the predictor by the squared norm \(\|\theta^*\|^2\), \(r\) close to zero represents the most difficult problems, and \(r\) large (especially \(r\geq1/2\)) represents simpler problems. 3. **Optimal Prediction Performance**: \[ O(n^{-\frac{2r\alpha}{2r\alpha + 1}}) \] Here, \(n\) is the sample size, and \(\alpha\) and \(r\) represent the eigenvalue decay parameter and the complexity parameter of the optimal predictor, respectively. 4. **Comparison between Single - pass and Multiple - pass**: - For easy problems (\(r\geq\frac{\alpha - 1}{2\alpha}\)), a single pass can achieve optimal performance. - For difficult problems (\(r\leq\frac{\alpha - 1}{2\alpha}\)), multiple passes are required to achieve optimal performance, and the number of passes increases as the sample size increases. ### Conclusion The main contribution of the paper is to reveal that in certain "difficult" problems, SGD with multiple passes over the data can achieve statistically optimal performance, while a single pass cannot. This is of great significance for understanding and improving the SGD algorithm in large - scale machine learning.

Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes

The Optimality of (Accelerated) SGD for High-Dimensional Quadratic Optimization

Towards Noise-adaptive, Problem-adaptive (Accelerated) Stochastic Gradient Descent

Why Does Multi-Epoch Training Help?

Implicit Regularization or Implicit Conditioning? Exact Risk Trajectories of SGD in High Dimensions

Accelerated SGD for Non-Strongly-Convex Least Squares

Nonasymptotic Analysis of Stochastic Gradient Descent with the Richardson-Romberg Extrapolation

High-probability Convergence Bounds for Nonlinear Stochastic Gradient Descent Under Heavy-tailed Noise

Stochastic Differential Equations models for Least-Squares Stochastic Gradient Descent

Stochastic Gradient Descent in the Viewpoint of Graduated Optimization

Hitting the High-Dimensional Notes: An ODE for SGD learning dynamics on GLMs and multi-index models

Demystifying SGD with Doubly Stochastic Gradients

Using Stochastic Gradient Descent to Smooth Nonconvex Functions: Analysis of Implicit Graduated Optimization with Optimal Noise Scheduling

Stochastic Proximal Gradient Algorithm with Minibatches. Application to Large Scale Learning Models

Aiming towards the minimizers: fast convergence of SGD for overparametrized problems

Optimal Adaptive and Accelerated Stochastic Gradient Descent

High-dimensional scaling limits and fluctuations of online least-squares SGD with smooth covariance

Multiplicative noise and heavy tails in stochastic optimization

Accelerated stochastic approximation with state-dependent noise

Loopless Semi-Stochastic Gradient Descent with Less Hard Thresholding for Sparse Learning

Beyond Single-Model Views for Deep Learning: Optimization versus Generalizability of Stochastic Optimization Algorithms