Abstract:Training a large-scale deep neural network in a large-scale dataset is challenging and time-consuming. The recent breakthrough of large-batch optimization is a promising way to tackle this challenge. However, although the current advanced algorithms such as LARS and LAMB succeed in classification models, the complicated pipelines of dense visual predictions such as object detection and segmentation still suffer from the heavy performance drop in the large-batch training regime. To address this challenge, we propose a simple yet effective algorithm, named Adaptive Gradient Variance Modulator (AGVM), which can train dense visual predictors with very large batch size, enabling several benefits more appealing than prior arts. Firstly, AGVM can align the gradient variances between different modules in the dense visual predictors, such as backbone, feature pyramid network (FPN), detection, and segmentation heads. We show that training with a large batch size can fail with the gradient variances misaligned among them, which is a phenomenon primarily overlooked in previous work. Secondly, AGVM is a plug-and-play module that generalizes well to many different architectures (e.g., CNNs and Transformers) and different tasks (e.g., object detection, instance segmentation, semantic segmentation, and panoptic segmentation). It is also compatible with different optimizers (e.g., SGD and AdamW). Thirdly, a theoretical analysis of AGVM is provided. Extensive experiments on the COCO and ADE20K datasets demonstrate the superiority of AGVM. For example, it can train Faster R-CNN+ResNet50 in 4 minutes without losing performance. AGVM enables training an object detector with one billion parameters in just 3.5 hours, reducing the training time by 20.9x, whilst achieving 62.2 mAP on COCO. The deliverables are released at https://github.com/Sense-X/AGVM.

Large Batch Optimization for Deep Learning Using New Complete Layer-Wise Adaptive Rate Scaling.

Large Batch Training of Convolutional Networks

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

A Large Batch Optimizer Reality Check: Traditional, Generic Optimizers Suffice Across Batch Sizes

Revisiting LARS for Large Batch Training Generalization of Neural Networks

AdaBatch: Adaptive Batch Sizes for Training Deep Neural Networks

Surge Phenomenon in Optimal Learning Rate and Batch Size Scaling

A Variable Batch Size Strategy for Large Scale Distributed DNN Training

AdAdaGrad: Adaptive Batch Size Schemes for Adaptive Gradient Methods

Concurrent Adversarial Learning for Large-Batch Training

An optimization Strategy for Deep Neural Networks Training

The large learning rate phase of deep learning: the catapult mechanism

Barzilai-Borwein-based Adaptive Learning Rate for Deep Learning

Adaptive Gradient Methods with Dynamic Bound of Learning Rate.

The Effect of Network Width on the Performance of Large-batch Training

Dynamic Batch Adaptation

Large-batch Optimization for Dense Visual Predictions

Step Out and Seek Around: On Warm-Start Training with Incremental Data

An Automatic Learning Rate Schedule Algorithm for Achieving Faster Convergence and Steeper Descent

Large-Scale Deep Learning Optimizations: A Comprehensive Survey