Abstract:Large batch distributed synchronous stochastic gradient descent (SGD) has been widely used to train deep neural networks on a distributed memory system with multi-nodes, which can leverage parallel resources to reduce the number of iterative steps and speed up the convergence of training process. However, the large-batch SGD leads to a poor test accuracy, which would counteract the benefits of large scale parallel SGD. Existing solutions for large-batch training either significantly degrade accuracy or require massive additional hyper-parameter tuning. To overcome the difficulty above, we propose a novel variable batchsize strategy. With an in-depth analysis of the different stages in the recent multi-step schedule, we find that the training process at the first stage is sensitive to the batchsize, while different batchsize do not significantly impact the latter stages. Based on this discovery, we first claim that different stages of training should use different batchsize. Hence, the variable batchsize strategy is proposed for the large scale distributed training. Furthermore, in order to turn existing hyper-parameters automatically, an auto-tuning engine is designed for the variable batchsize strategy to achieve higher testing accuracy in the extremely large batchsize cases. By using our strategy, we successfully scale the batchsize to 120K in latter stages on ImageNet-1K with ResNet50 without accuracy loss and 128K with slight accuracy loss. The experimental evaluation on 2048 GPUs shows that the variable batchsize strategy with our auto-tuning engine could complete the training of ResNet-50 in 25 minutes. Furthermore, the new strategy successfully decreases the number of parameter updates by about 1.7 times compared with Facebook's multi-step schedule.

An optimization Strategy for Deep Neural Networks Training

An automatic learning rate decay strategy for stochastic gradient descent optimization methods in neural networks

An Efficient Optimization Technique for Training Deep Neural Networks

Gradient Descent Optimization in Deep Learning Model Training Based on Multistage and Method Combination Strategy

Optimization Algorithm Inspired Deep Neural Network Structure Design

Optimal Linear Decay Learning Rate Schedules and Further Refinements

An Automatic Learning Rate Schedule Algorithm for Achieving Faster Convergence and Steeper Descent

A Variable Batch Size Strategy for Large Scale Distributed DNN Training

Surge Phenomenon in Optimal Learning Rate and Batch Size Scaling

Budgeted Training: Rethinking Deep Neural Network Training Under Resource Constraints

Accelerated Training via Incrementally Growing Neural Networks using Variance Transfer and Learning Rate Adaptation

Optimization for deep learning: theory and algorithms

Learning Rate Perturbation: A Generic Plugin of Learning Rate Schedule towards Flatter Local Minima

Learning Rate Optimization for Deep Neural Networks Using Lipschitz Bandits

Interpreting Adaptive Gradient Methods by Parameter Scaling for Learning-Rate-Free Optimization

Reconciling Modern Deep Learning with Traditional Optimization Analyses: The Intrinsic Learning Rate

Exponential decay sine wave learning rate for fast deep neural network training

Learning by Turning: Neural Architecture Aware Optimisation

Adaptive learning rate optimization algorithms with dynamic bound based on Barzilai-Borwein method

How Does Learning Rate Decay Help Modern Neural Networks?