Abstract:Large batch distributed synchronous stochastic gradient descent (SGD) has been widely used to train deep neural networks on a distributed memory system with multi-nodes, which can leverage parallel resources to reduce the number of iterative steps and speed up the convergence of training process. However, the large-batch SGD leads to a poor test accuracy, which would counteract the benefits of large scale parallel SGD. Existing solutions for large-batch training either significantly degrade accuracy or require massive additional hyper-parameter tuning. To overcome the difficulty above, we propose a novel variable batchsize strategy. With an in-depth analysis of the different stages in the recent multi-step schedule, we find that the training process at the first stage is sensitive to the batchsize, while different batchsize do not significantly impact the latter stages. Based on this discovery, we first claim that different stages of training should use different batchsize. Hence, the variable batchsize strategy is proposed for the large scale distributed training. Furthermore, in order to turn existing hyper-parameters automatically, an auto-tuning engine is designed for the variable batchsize strategy to achieve higher testing accuracy in the extremely large batchsize cases. By using our strategy, we successfully scale the batchsize to 120K in latter stages on ImageNet-1K with ResNet50 without accuracy loss and 128K with slight accuracy loss. The experimental evaluation on 2048 GPUs shows that the variable batchsize strategy with our auto-tuning engine could complete the training of ResNet-50 in 25 minutes. Furthermore, the new strategy successfully decreases the number of parameter updates by about 1.7 times compared with Facebook's multi-step schedule.

Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes

ImageNet Training in Minutes

Training EfficientNets at Supercomputer Scale: 83% ImageNet Top-1 Accuracy in One Hour

A Variable Batch Size Strategy for Large Scale Distributed DNN Training

Out-of-core Training for Extremely Large-Scale Neural Networks With Adaptive Window-Based Scheduling

Large Batch Training of Convolutional Networks

Concurrent Adversarial Learning for Large-Batch Training

Fast and accurate variable batch size convolution neural network training on large scale distributed systems

AdaBatch: Adaptive Batch Sizes for Training Deep Neural Networks

Pipelined Backpropagation at Scale: Training Large Models without Batches

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

Accelerating Large Batch Training via Gradient Signal to Noise Ratio (GSNR)

A Multigrid Method for Efficiently Training Video Models

FastHebb: Scaling Hebbian Training of Deep Neural Networks to ImageNet Level

Optimizing Network Performance for Distributed DNN Training on GPU Clusters: ImageNet/AlexNet Training in 1.5 Minutes

RRR-Net: Reusing, Reducing, and Recycling a Deep Backbone Network

Large Batch Optimization for Object Detection: Training COCO in 12 minutes

Small-GAN: Speeding Up GAN Training Using Core-sets