Kyurae Kim,Joohwan Ko,Yi-An Ma,Jacob R. Gardner
Abstract:Optimization objectives in the form of a sum of intractable expectations are rising in importance (e.g., diffusion models, variational autoencoders, and many more), a setting also known as "finite sum with infinite data." For these problems, a popular strategy is to employ SGD with doubly stochastic gradients (doubly SGD): the expectations are estimated using the gradient estimator of each component, while the sum is estimated by subsampling over these estimators. Despite its popularity, little is known about the convergence properties of doubly SGD, except under strong assumptions such as bounded variance. In this work, we establish the convergence of doubly SGD with independent minibatching and random reshuffling under general conditions, which encompasses dependent component gradient estimators. In particular, for dependent estimators, our analysis allows fined-grained analysis of the effect correlations. As a result, under a per-iteration computational budget of $b \times m$, where $b$ is the minibatch size and $m$ is the number of Monte Carlo samples, our analysis suggests where one should invest most of the budget in general. Furthermore, we prove that random reshuffling (RR) improves the complexity dependence on the subsampling noise.
What problem does this paper attempt to address?
### Problems the paper attempts to solve
The paper "Demystifying SGD with Doubly Stochastic Gradients" attempts to solve optimization problems that are increasingly important in machine learning. These problems are in the form of a sum of intractable expectations, also known as "finite sum with infinite data". Specifically, such problems include the training of complex models such as diffusion models and variational auto - encoders.
For these problems, a popular strategy is to use stochastic gradient descent with doubly stochastic gradients (SGD with doubly stochastic gradients, abbreviated as doubly SGD). In this method, the expectation of each component is estimated by its gradient estimator, and the sum is estimated by subsampling these estimators. However, although this method is very popular, there are relatively few studies on the convergence properties of doubly SGD, especially under general conditions, especially when there are dependencies between component gradient estimators.
### Main contributions
1. **Theoretical analysis**:
- **Theorem 1**: A general variance bound for the doubly stochastic estimator is established, in the form of:
\[
\text{tr}(\mathbb{E}[\hat{\theta}]) \leq \frac{1}{B} \sum_{i = 1}^B \sigma_i^2 \left( \frac{1}{\alpha m}+\rho \right)+\frac{\sigma_C^2}{m}
\]
where \(\sigma_i^2\) is the variance of the \(i\)-th component estimator, \(\rho\in[0, 1]\) is the correlation between estimators, and \(\sigma_C^2\) is the variance of subsampling.
- **Theorems 2 and 3**: Using the general variance bound, it is proved that when the expected residual (ER) condition and the bounded variance (BV) condition are satisfied, the doubly stochastic estimator with correlated estimators also satisfies these conditions, thus ensuring the convergence of doubly SGD on convex, quasi - convex and non - convex smooth objective functions.
- **Theorem 5**: Under similar assumptions, it is proved that doubly SGD with random reshuffling (RR) (i.e., doubly SGD - RR) converges on strongly convex objective functions.
2. **Practical insights**:
- **Budget allocation**: When using dependent gradient estimators, increasing the number of Monte Carlo samples \(m\) or the sub - batch size \(b\) has different effects on the gradient variance. Through the analysis of Lemma 9, it is revealed that reducing the subsampling variance can also reduce the Monte Carlo variance. Therefore, under a fixed budget \(m\times b\), increasing \(b\) should be prioritized over increasing \(m\).
- **Advantages of random reshuffling**: The analysis shows that for strongly convex objective functions, random reshuffling can improve the iteration complexity of doubly SGD. Specifically, it is improved from \(\mathcal{O}\left(\frac{1}{\epsilon^2}+\frac{1}{\epsilon}\right)\) to \(\mathcal{O}\left(\frac{1}{\epsilon^2}+\frac{1}{\sqrt{\epsilon}}\right)\). In addition, for dependent gradient estimators, doubly SGD - RR is "super - efficient", and for batches that require \(\Theta(m\times b)\) samples to calculate, it has a more compact asymptotic sample complexity than full - batch SGD.
### Conclusion
The paper provides a theoretical basis for understanding the performance of doubly SGD in dealing with "finite sum with infinite data" problems by establishing a general variance bound for the doubly stochastic gradient estimator and analyzing its convergence properties under different conditions. At the same time, the paper...