Abstract:In recent years, deep learning models have been successfully applied to large-scale data analysis, including image classification, video caption, natural language processing, etc. Large-scale data analyses take advantage of parallel computing to accelerate the speed of model training, in which data parallelism has become the dominant method for deep learning model training due to its high throughput rate. Synchronous stochastic gradient descent optimization becomes a well-recognized optimization method to ensure model convergence, but the overhead of gradients synchronization increases linearly as the number of workers increases, causing a huge waste of time. Although some efficiency-first asynchronous methods have been proposed, these methods cannot guarantee their convergence in large-scale distributed training. To solve this problem, we propose an efficient pseudo-synchronous approach that updates the network with the previous gradient, performing the synchronization of a new gradient to overlap computation and synchronization. This idea will obviously affect the normal convergence of the model, so we propose a novel adaptive exponential smoothing predicted gradient algorithm for model optimization, which can adaptively adjust the confidence coefficient of the history gradient to ensure the normal convergence of the training process. Experiments prove that our method can speed up the training process and achieve a comparable accuracy rate with standard synchronous SGD. Besides, our method has more efficient weak scalability compared to the traditional synchronous SGD and those in previous related work. We apply our methods to image recognition and video caption applications at most 12288 cores with strong scalability on Tianhe II. Evaluations show that, when configured appropriately, our method attains near-linear scalability using 128 nodes. We get 93.4% weak scaling efficiency on 64 nodes, 90.5% on 128 nodes.

Massively scalable prototype learning for heterogeneous parallel computing architecture

Scalable Prototype Learning Using Gpus

DaDianNao: A Machine-Learning Supercomputer

Toward Large-Scale Evolutionary Multitasking: A GPU-Based Paradigm

Towards Large-Scale Evolutionary Multi-Tasking: A GPU-Based Paradigm

An Efficient 2D Method for Training Super-Large Deep Learning Models

BabelTower: Learning to Auto-parallelized Program Translation.

IMPLEMENTATION OF A MASSIVELY PARALLEL METHOD OF CHARACTERISTICS NEUTRON TRANSPORT CALCULATION ON CPUS/GPUS HETEROGENEOUS HIGHPERFORMANCE COMPUTING CLUSTERS

HPH: Hybrid Parallelism on Heterogeneous Clusters for Accelerating Large-scale DNNs Training.

Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform

Computron: Serving Distributed Deep Learning Models with Model Parallel Swapping

HETHUB: A Distributed Training System with Heterogeneous Cluster for Large-Scale Models

MLPs: Efficient Training of MiniGo on Large-scale Heterogeneous Computing System

Accelerating Spatiotemporal Supervised Training of Large-Scale Spiking Neural Networks on GPU

Accelerating Massively Distributed Deep Learning Through Efficient Pseudo-Synchronous Update Method

A New Method To Parallel Implementation For Batching Vast Small-Scale Computing Tasks Based On Gpu

Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms

Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units.

Large Scale Image Classification Using GPU-based Genetic Programming.

Parallel $Q$-Learning: Scaling Off-policy Reinforcement Learning under Massively Parallel Simulation

Poseidon: A System Architecture for Efficient GPU-based Deep Learning on Multiple Machines