Abstract:When using stochastic gradient descent (SGD) to solve large-scale machine learning problems especially deep learning problems, a common practice of data processing is to shuffle the training data, partition the data across multiple threads/machines if needed, and then perform several epochs of training on the re-shuffled (either locally or globally) data. The above procedure makes the instances used to compute the gradients no longer independently sampled from the training data set, which contradicts with the basic assumptions of conventional convergence analysis of SGD. Then does the distributed SGD method have desirable convergence properties in this practical situation? In this paper, we give answers to this question. First, we give a mathematical formulation for the practical data processing procedure in distributed machine learning, which we call (data partition with) global/local shuffling. We observe that global shuffling is equivalent to without-replacement sampling if the shuffling operations are independent. Second, we prove SGD with global shuffling and local shuffling has convergence guarantee for non-convex tasks like deep learning. The convergence rate for local shuffling is slower than that for global shuffling, since it will lose some information if there’s no communication between partitioned data. We also consider the situation when the permutation after shuffling is not uniformly distributed (We call it insufficient shuffling), and discuss the condition under which this insufficiency will not influence the convergence rate. Finally, we give the convergence analysis in convex case. An interesting finding is that, the non-convex tasks like deep learning are more suitable to apply shuffling comparing to the convex tasks. Our theoretical results provide important insights to large-scale machine learning, especially in the selection of data processing methods in order to achieve faster convergence and good speedup. Our theoretical findings are verified by extensive experiments on logistic regression and deep neural networks.

Variance-reduced Reshuffling Gradient Descent for Nonconvex Optimization: Centralized and Distributed Algorithms

Gradient tracking and variance reduction for decentralized optimization and machine learning

Shuffling Gradient Descent-Ascent with Variance Reduction for Nonconvex-Strongly Concave Smooth Minimax Problems

Variance Reduced EXTRA and DIGing and Their Optimal Acceleration for Strongly Convex Decentralized Optimization

Convergence in High Probability of Distributed Stochastic Gradient Descent Algorithms

A Communication-Efficient Stochastic Gradient Descent Algorithm for Distributed Nonconvex Optimization

Adaptive Variance Reducing for Stochastic Gradient Descent.

Optimal Accelerated Variance Reduced EXTRA and DIGing for Strongly Convex and Smooth Decentralized Optimization.

Decentralized Stochastic Proximal Gradient Descent with Variance Reduction over Time-varying Networks

Variance-Reduced Gradient Estimator for Nonconvex Zeroth-Order Distributed Optimization

Variance-Reduced Proximal Stochastic Gradient Descent for Non-convex Composite optimization.

A Zeroth-Order Variance-Reduced Method for Decentralized Stochastic Non-convex Optimization

Asynchronous Decentralized Accelerated Stochastic Gradient Descent

Variance-reduced accelerated methods for decentralized stochastic double-regularized nonconvex strongly-concave minimax problems

Parallel Asynchronous Stochastic Variance Reduction for Nonconvex Optimization

Can Decentralized Stochastic Minimax Optimization Algorithms Converge Linearly for Finite-Sum Nonconvex-Nonconcave Problems?

Convergence of Sign-based Random Reshuffling Algorithms for Nonconvex Optimization

Stochastic Nested Variance Reduction for Nonconvex Optimization

Simple and Optimal Stochastic Gradient Methods for Nonsmooth Nonconvex Optimization

Convergence Analysis of Distributed Stochastic Gradient Descent with Shuffling

Variance-Reduced Stochastic Quasi-Newton Methods for Decentralized Learning: Part II