Abstract:In decentralized optimization, $m$ agents form a network and only communicate with their neighbors, which gives advantages in data ownership, privacy, and scalability. At the same time, decentralized stochastic gradient descent (\texttt{SGD}) methods, as popular decentralized algorithms for training large-scale machine learning models, have shown their superiority over centralized counterparts. Distributed stochastic gradient tracking~(\texttt{DSGT})~\citep{pu2021distributed} has been recognized as the popular and state-of-the-art decentralized \texttt{SGD} method due to its proper theoretical guarantees. However, the theoretical analysis of \dsgt~\citep{koloskova2021improved} shows that its iteration complexity is $\tilde{\mathcal{O}} \left(\frac{\bar{\sigma}^2}{m\mu \varepsilon} + \frac{\sqrt{L}\bar{\sigma}}{\mu(1 - \lambda_2(W))^{1/2} C_W \sqrt{\varepsilon} }\right)$, where $W$ is a double stochastic mixing matrix that presents the network topology and $ C_W $ is a parameter that depends on $W$. Thus, it indicates that the convergence property of \texttt{DSGT} is heavily affected by the topology of the communication network. To overcome the weakness of \texttt{DSGT}, we resort to the snap-shot gradient tracking skill and propose two novel algorithms. We further justify that the proposed two algorithms are more robust to the topology of communication networks under similar algorithmic structures and the same communication strategy to \dsgt~. Compared with \dsgt, their iteration complexity are $\mathcal{O}\left( \frac{\bar{\sigma}^2}{m\mu\varepsilon} + \frac{\sqrt{L}\bar{\sigma}}{\mu (1 - \lambda_2(W))\sqrt{\varepsilon}} \right)$ and $\mathcal{O}\left( \frac{\bar{\sigma}^2}{m\mu \varepsilon} + \frac{\sqrt{L}\bar{\sigma}}{\mu (1 - \lambda_2(W))^{1/2}\sqrt{\varepsilon}} \right)$ which reduce the impact on network topology (no $C_W$).

Variance-Reduced Decentralized Stochastic Optimization with Gradient Tracking -- Part II: GT-SVRG

A variance-reduced stochastic gradient tracking algorithm for decentralized optimization with orthogonality constraints

Larger is Better: The Effect of Learning Rates Enjoyed by Stochastic Optimization with Progressive Variance Reduction

VR-SGD: A Simple Stochastic Variance Reduction Method for Machine Learning

Trading-off variance and complexity in stochastic gradient descent

Stochastic Sub-Sampled Newton Method with Variance Reduction

Gradient tracking and variance reduction for decentralized optimization and machine learning

Decentralized Stochastic Gradient Tracking for Non-convex Empirical Risk Minimization

A stochastic variance reduced gradient method with adaptive step for stochastic optimization

Parallel Asynchronous Stochastic Variance Reduction for Nonconvex Optimization

Byzantine-Robust Loopless Stochastic Variance-Reduced Gradient

Adaptive Variance Reducing for Stochastic Gradient Descent.

Variance-Reduced Proximal Stochastic Gradient Descent for Non-convex Composite optimization.

Closing the gap between SVRG and TD-SVRG with Gradient Splitting

Decentralized Sum-of-Nonconvex Optimization

Stochastic Nested Variance Reduction for Nonconvex Optimization

Decentralized Stochastic Proximal Gradient Descent with Variance Reduction over Time-varying Networks

Variance reduction techniques for stochastic proximal point algorithms

Snap-Shot Decentralized Stochastic Gradient Tracking Methods

SVRG Meets AdaGrad: Painless Variance Reduction

Decentralized gradient tracking with local steps