Abstract:In recent years, deep learning models have been successfully applied to large-scale data analysis, including image classification, video caption, natural language processing, etc. Large-scale data analyses take advantage of parallel computing to accelerate the speed of model training, in which data parallelism has become the dominant method for deep learning model training due to its high throughput rate. Synchronous stochastic gradient descent optimization becomes a well-recognized optimization method to ensure model convergence, but the overhead of gradients synchronization increases linearly as the number of workers increases, causing a huge waste of time. Although some efficiency-first asynchronous methods have been proposed, these methods cannot guarantee their convergence in large-scale distributed training. To solve this problem, we propose an efficient pseudo-synchronous approach that updates the network with the previous gradient, performing the synchronization of a new gradient to overlap computation and synchronization. This idea will obviously affect the normal convergence of the model, so we propose a novel adaptive exponential smoothing predicted gradient algorithm for model optimization, which can adaptively adjust the confidence coefficient of the history gradient to ensure the normal convergence of the training process. Experiments prove that our method can speed up the training process and achieve a comparable accuracy rate with standard synchronous SGD. Besides, our method has more efficient weak scalability compared to the traditional synchronous SGD and those in previous related work. We apply our methods to image recognition and video caption applications at most 12288 cores with strong scalability on Tianhe II. Evaluations show that, when configured appropriately, our method attains near-linear scalability using 128 nodes. We get 93.4% weak scaling efficiency on 64 nodes, 90.5% on 128 nodes.

One Backward from Ten Forward, Subsampling for Large-Scale Deep Learning

Adaptive Client Sampling in Federated Learning via Online Learning with Bandit Feedback

AdaSelection: Accelerating Deep Learning Training through Data Subsampling

Adaptive Sampling for Deep Learning via Efficient Nonparametric Proxies

Adaptive Sampling and Reconstruction for Gradient-Domain Rendering

FLOPS: Forward Learning with OPtimal Sampling

Lsh-sampling Breaks the Computation Chicken-and-egg Loop in Adaptive Stochastic Gradient Estimation

Accelerating Stochastic Gradient Descent Using Antithetic Sampling.

Accelerating Machine Learning Algorithms with Adaptive Sampling

Minimizing Energy Costs in Deep Learning Model Training: The Gaussian Sampling Approach

Drill the Cork of Information Bottleneck by Inputting the Most Important Data

Can we learn better with hard samples?

Accelerated Doubly Stochastic Gradient Algorithm for Large-scale Empirical Risk Minimization

OneAdapt: Fast Configuration Adaptation for Video Analytics Applications via Backpropagation

Asynchronous Stochastic Gradient Descent with Decoupled Backpropagation and Layer-Wise Updates

Accelerating Massively Distributed Deep Learning Through Efficient Pseudo-Synchronous Update Method

Grad Queue : A probabilistic framework to reinforce sparse gradients

Towards Better Generalization of Deep Neural Networks via Non-Typicality Sampling Scheme

Accelerating Minibatch Stochastic Gradient Descent Using Typicality Sampling

Provably Convergent Subgraph-wise Sampling for Fast GNN Training