Abstract:Optimization objectives in the form of a sum of intractable expectations are rising in importance (e.g., diffusion models, variational autoencoders, and many more), a setting also known as "finite sum with infinite data." For these problems, a popular strategy is to employ SGD with doubly stochastic gradients (doubly SGD): the expectations are estimated using the gradient estimator of each component, while the sum is estimated by subsampling over these estimators. Despite its popularity, little is known about the convergence properties of doubly SGD, except under strong assumptions such as bounded variance. In this work, we establish the convergence of doubly SGD with independent minibatching and random reshuffling under general conditions, which encompasses dependent component gradient estimators. In particular, for dependent estimators, our analysis allows fined-grained analysis of the effect correlations. As a result, under a per-iteration computational budget of $b \times m$, where $b$ is the minibatch size and $m$ is the number of Monte Carlo samples, our analysis suggests where one should invest most of the budget in general. Furthermore, we prove that random reshuffling (RR) improves the complexity dependence on the subsampling noise.

What problem does this paper attempt to address?

### Problems the paper attempts to solve The paper "Demystifying SGD with Doubly Stochastic Gradients" attempts to solve optimization problems that are increasingly important in machine learning. These problems are in the form of a sum of intractable expectations, also known as "finite sum with infinite data". Specifically, such problems include the training of complex models such as diffusion models and variational auto - encoders. For these problems, a popular strategy is to use stochastic gradient descent with doubly stochastic gradients (SGD with doubly stochastic gradients, abbreviated as doubly SGD). In this method, the expectation of each component is estimated by its gradient estimator, and the sum is estimated by subsampling these estimators. However, although this method is very popular, there are relatively few studies on the convergence properties of doubly SGD, especially under general conditions, especially when there are dependencies between component gradient estimators. ### Main contributions 1. **Theoretical analysis**: - **Theorem 1**: A general variance bound for the doubly stochastic estimator is established, in the form of: \[ \text{tr}(\mathbb{E}[\hat{\theta}]) \leq \frac{1}{B} \sum_{i = 1}^B \sigma_i^2 \left( \frac{1}{\alpha m}+\rho \right)+\frac{\sigma_C^2}{m} \] where $\sigma_i^2$ is the variance of the $i$-th component estimator, $\rho\in[0, 1]$ is the correlation between estimators, and $\sigma_C^2$ is the variance of subsampling. - **Theorems 2 and 3**: Using the general variance bound, it is proved that when the expected residual (ER) condition and the bounded variance (BV) condition are satisfied, the doubly stochastic estimator with correlated estimators also satisfies these conditions, thus ensuring the convergence of doubly SGD on convex, quasi - convex and non - convex smooth objective functions. - **Theorem 5**: Under similar assumptions, it is proved that doubly SGD with random reshuffling (RR) (i.e., doubly SGD - RR) converges on strongly convex objective functions. 2. **Practical insights**: - **Budget allocation**: When using dependent gradient estimators, increasing the number of Monte Carlo samples $m$ or the sub - batch size $b$ has different effects on the gradient variance. Through the analysis of Lemma 9, it is revealed that reducing the subsampling variance can also reduce the Monte Carlo variance. Therefore, under a fixed budget $m\times b$, increasing $b$ should be prioritized over increasing $m$. - **Advantages of random reshuffling**: The analysis shows that for strongly convex objective functions, random reshuffling can improve the iteration complexity of doubly SGD. Specifically, it is improved from $\mathcal{O}\left(\frac{1}{\epsilon^2}+\frac{1}{\epsilon}\right)$ to $\mathcal{O}\left(\frac{1}{\epsilon^2}+\frac{1}{\sqrt{\epsilon}}\right)$. In addition, for dependent gradient estimators, doubly SGD - RR is "super - efficient", and for batches that require $\Theta(m\times b)$ samples to calculate, it has a more compact asymptotic sample complexity than full - batch SGD. ### Conclusion The paper provides a theoretical basis for understanding the performance of doubly SGD in dealing with "finite sum with infinite data" problems by establishing a general variance bound for the doubly stochastic gradient estimator and analyzing its convergence properties under different conditions. At the same time, the paper...

Demystifying SGD with Doubly Stochastic Gradients

Empirical Risk Minimization with Shuffled SGD: A Primal-Dual Perspective and Improved Bounds

Aiming towards the minimizers: fast convergence of SGD for overparametrized problems

The Optimality of (Accelerated) SGD for High-Dimensional Quadratic Optimization

Gradient Diversity Empowers Distributed Learning: Convergence and Stability of Mini-batch SGD

Stability and Generalization for Minibatch SGD and Local SGD

Convergence Analysis of Stochastic Gradient Descent with MCMC Estimators

Implicit Regularization or Implicit Conditioning? Exact Risk Trajectories of SGD in High Dimensions

A generalization of regularized dual averaging and its dynamics

Shuffling Gradient Descent-Ascent with Variance Reduction for Nonconvex-Strongly Concave Smooth Minimax Problems

Batch Size Matters: A Diffusion Approximation Framework on Nonconvex Stochastic Gradient Descent.

$μ^2$-SGD: Stable Stochastic Optimization via a Double Momentum Mechanism

Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm

Gradient Diversity: a Key Ingredient for Scalable Distributed Learning.

The Dimension Strikes Back with Gradients: Generalization of Gradient Methods in Stochastic Convex Optimization

Stochastic Methods in Variational Inequalities: Ergodicity, Bias and Refinements

Stochastic Gradient Descent in the Viewpoint of Graduated Optimization

Accelerated stochastic approximation with state-dependent noise

Double Stochasticity Gazes Faster: Snap-Shot Decentralized Stochastic Gradient Tracking Methods

Non-asymptotic Analysis of Biased Adaptive Stochastic Approximation

Convergence Analysis of Accelerated Stochastic Gradient Descent under the Growth Condition