Stable and low-precision training for large-scale vision-language models

Mitchell Wortsman,Tim Dettmers,Luke Zettlemoyer,Ari Morcos,Ali Farhadi,Ludwig Schmidt
2023-10-17
Abstract:We introduce new methods for 1) accelerating and 2) stabilizing training for large language-vision models. 1) For acceleration, we introduce SwitchBack, a linear layer for int8 quantized training which provides a speed-up of 13-25% while matching the performance of bfloat16 training within 0.1 percentage points for the 1B parameter CLIP ViT-Huge -- the largest int8 training to date. Our main focus is int8 as GPU support for float8 is rare, though we also analyze float8 training through simulation. While SwitchBack proves effective for float8, we show that standard techniques are also successful if the network is trained and initialized so that large feature magnitudes are discouraged, which we accomplish via layer-scale initialized with zeros. 2) For stability, we analyze loss spikes and find they consistently occur 1-8 iterations after the squared gradients become under-estimated by their AdamW second moment estimator. As a result, we recommend an AdamW-Adafactor hybrid which avoids loss spikes when training a CLIP ViT-Huge model and outperforms gradient clipping at the scales we test.
Machine Learning,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve two key problems in the training of large - scale vision - language models: **accelerating training** and **stabilizing training**. 1. **Accelerating training**: - **Problem background**: When large - scale models are trained on large - scale datasets, the consumption of computing resources is huge, and the training speed becomes a bottleneck. Especially when training with low precision (such as int8 or float8), how to improve the training speed while ensuring performance is an important challenge. - **Specific objective**: Propose new methods to accelerate the training of large - scale vision - language models, especially by introducing a linear layer named SwitchBack, which can be trained with int8 precision while maintaining performance comparable to bfloat16 training. 2. **Stabilizing training**: - **Problem background**: In large - scale model training, the fluctuation of the loss function (i.e., loss spikes) will lead to performance degradation, affecting the stability and final effect of the model. - **Specific objective**: Analyze the causes of loss spikes and propose a hybrid optimizer (AdamW - Adafactor) to reduce the occurrence of loss spikes, thereby improving the stability of training. ### Main contributions 1. **Accelerating training**: - **SwitchBack**: A new linear layer called SwitchBack is introduced. It uses 8 - bit precision (int8) in forward propagation and input gradient calculation, but switches back to higher precision (16 - bit) in weight gradient calculation. This makes the training speed of SwitchBack on the CLIP ViT - Huge model 13 - 25% faster than that of the standard PyTorch layer, and the zero - shot ImageNet accuracy is comparable to that of bfloat16 training. - **Floating - point 8 - bit (fp8) training**: By simulating fp8 calculations, the training of large - scale models at fp8 precision is studied. It is found that the method of tensor quantization is prone to divergence on large - scale models, but by initializing the layer scale to zero, the feature magnitude can be effectively controlled to achieve high - precision training. 2. **Stabilizing training**: - **Loss spike analysis**: The loss spike phenomenon in CLIP training is analyzed, and it is found that these spikes usually occur when the second - moment estimator of AdamW underestimates the squared gradient. - **StableAdamW**: A hybrid optimizer of AdamW - Adafactor, called StableAdamW, is proposed. By tracking the ratio of the squared gradient to the second - moment estimator and reducing the learning rate when the ratio is large, loss spikes are avoided and training stability is improved. ### Experimental results - **Accelerating training**: - **Accuracy**: SwitchBack can match the performance of bfloat16 training at both int8 and fp8 precisions. Especially on the CLIP ViT - Huge model, the accuracy difference is only 0.1 percentage points. - **Speed**: In the end - to - end CLIP training test on a single - node with 4 A100 GPUs, the acceleration range of SwitchBack compared to the standard PyTorch layer varies from 5% to 35%. - **Stabilizing training**: - **Loss curve**: The experimental results show that the model using the StableAdamW optimizer has no loss spikes during the training process, while the models using other methods have obvious loss fluctuations. - **Feature magnitude**: By initializing the layer scale to zero, the feature magnitude is successfully controlled, and the divergence problem in low - precision training is avoided. ### Summary This paper solves the acceleration and stability problems in the training of large - scale vision - language models by introducing the SwitchBack linear layer and the StableAdamW optimizer respectively, providing an effective technical solution for future large - scale model training.