Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes

Loucas Pillaud-Vivien,Alessandro Rudi,Francis Bach
DOI: https://doi.org/10.48550/arXiv.1805.10074
2018-11-23
Abstract:We consider stochastic gradient descent (SGD) for least-squares regression with potentially several passes over the data. While several passes have been widely reported to perform practically better in terms of predictive performance on unseen data, the existing theoretical analysis of SGD suggests that a single pass is statistically optimal. While this is true for low-dimensional easy problems, we show that for hard problems, multiple passes lead to statistically optimal predictions while single pass does not; we also show that in these hard models, the optimal number of passes over the data increases with sample size. In order to define the notion of hardness and show that our predictive performances are optimal, we consider potentially infinite-dimensional models and notions typically associated to kernel methods, namely, the decay of eigenvalues of the covariance matrix of the features and the complexity of the optimal predictor as measured through the covariance matrix. We illustrate our results on synthetic experiments with non-linear kernel methods and on a classical benchmark with a linear model.
Machine Learning,Optimization and Control,Statistics Theory
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in the least - squares regression problem, whether the stochastic gradient descent (SGD) with a single pass over the data is statistically optimal in all cases. Existing theoretical analyses show that in low - dimensional simple problems, SGD with a single pass over the data can indeed achieve statistically optimal performance. However, in practical applications, multiple passes over the data usually bring better generalization performance, especially when dealing with high - dimensional complex problems. Specifically, the paper explores the following questions: - In what types of "difficult" problems can SGD with multiple passes over the data achieve statistically optimal prediction performance, while a single pass cannot. - For these difficult problems, how does the number of passes over the data change as the sample size increases. To define and characterize these problems, the authors introduce tools in infinite - dimensional models and use concepts common in kernel methods, such as the eigenvalue decay rate of the feature covariance matrix and the complexity of the optimal predictor. Using these tools, the authors prove that for certain "difficult" problems, SGD with multiple passes over the data can indeed achieve statistically optimal performance, while a single pass cannot. In addition, the authors also show that in these difficult problems, as the sample size increases, the number of passes over the data also increases. ### Formula Summary 1. **Eigenvalue Decay**: \[ \lambda_m = O(m^{-\alpha}), \quad \alpha \geq 1 \] Here, \(\alpha\) represents the size of the feature space, \(\alpha = 1\) corresponds to the largest feature space, and \(\alpha = +\infty\) corresponds to a finite - dimensional space. 2. **Complexity of the Optimal Predictor**: \[ \langle \theta^*, \Sigma^{1 - 2r} \theta^* \rangle \text{ small} \] The parameter \(r\geq0\) represents the difficulty of the learning problem, \(r = 1/2\) corresponds to measuring the complexity of the predictor by the squared norm \(\|\theta^*\|^2\), \(r\) close to zero represents the most difficult problems, and \(r\) large (especially \(r\geq1/2\)) represents simpler problems. 3. **Optimal Prediction Performance**: \[ O(n^{-\frac{2r\alpha}{2r\alpha + 1}}) \] Here, \(n\) is the sample size, and \(\alpha\) and \(r\) represent the eigenvalue decay parameter and the complexity parameter of the optimal predictor, respectively. 4. **Comparison between Single - pass and Multiple - pass**: - For easy problems (\(r\geq\frac{\alpha - 1}{2\alpha}\)), a single pass can achieve optimal performance. - For difficult problems (\(r\leq\frac{\alpha - 1}{2\alpha}\)), multiple passes are required to achieve optimal performance, and the number of passes increases as the sample size increases. ### Conclusion The main contribution of the paper is to reveal that in certain "difficult" problems, SGD with multiple passes over the data can achieve statistically optimal performance, while a single pass cannot. This is of great significance for understanding and improving the SGD algorithm in large - scale machine learning.