Abstract:We introduce new methods for 1) accelerating and 2) stabilizing training for large language-vision models. 1) For acceleration, we introduce SwitchBack, a linear layer for int8 quantized training which provides a speed-up of 13-25% while matching the performance of bfloat16 training within 0.1 percentage points for the 1B parameter CLIP ViT-Huge -- the largest int8 training to date. Our main focus is int8 as GPU support for float8 is rare, though we also analyze float8 training through simulation. While SwitchBack proves effective for float8, we show that standard techniques are also successful if the network is trained and initialized so that large feature magnitudes are discouraged, which we accomplish via layer-scale initialized with zeros. 2) For stability, we analyze loss spikes and find they consistently occur 1-8 iterations after the squared gradients become under-estimated by their AdamW second moment estimator. As a result, we recommend an AdamW-Adafactor hybrid which avoids loss spikes when training a CLIP ViT-Huge model and outperforms gradient clipping at the scales we test.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve two key problems in the training of large - scale vision - language models: **accelerating training** and **stabilizing training**. 1. **Accelerating training**: - **Problem background**: When large - scale models are trained on large - scale datasets, the consumption of computing resources is huge, and the training speed becomes a bottleneck. Especially when training with low precision (such as int8 or float8), how to improve the training speed while ensuring performance is an important challenge. - **Specific objective**: Propose new methods to accelerate the training of large - scale vision - language models, especially by introducing a linear layer named SwitchBack, which can be trained with int8 precision while maintaining performance comparable to bfloat16 training. 2. **Stabilizing training**: - **Problem background**: In large - scale model training, the fluctuation of the loss function (i.e., loss spikes) will lead to performance degradation, affecting the stability and final effect of the model. - **Specific objective**: Analyze the causes of loss spikes and propose a hybrid optimizer (AdamW - Adafactor) to reduce the occurrence of loss spikes, thereby improving the stability of training. ### Main contributions 1. **Accelerating training**: - **SwitchBack**: A new linear layer called SwitchBack is introduced. It uses 8 - bit precision (int8) in forward propagation and input gradient calculation, but switches back to higher precision (16 - bit) in weight gradient calculation. This makes the training speed of SwitchBack on the CLIP ViT - Huge model 13 - 25% faster than that of the standard PyTorch layer, and the zero - shot ImageNet accuracy is comparable to that of bfloat16 training. - **Floating - point 8 - bit (fp8) training**: By simulating fp8 calculations, the training of large - scale models at fp8 precision is studied. It is found that the method of tensor quantization is prone to divergence on large - scale models, but by initializing the layer scale to zero, the feature magnitude can be effectively controlled to achieve high - precision training. 2. **Stabilizing training**: - **Loss spike analysis**: The loss spike phenomenon in CLIP training is analyzed, and it is found that these spikes usually occur when the second - moment estimator of AdamW underestimates the squared gradient. - **StableAdamW**: A hybrid optimizer of AdamW - Adafactor, called StableAdamW, is proposed. By tracking the ratio of the squared gradient to the second - moment estimator and reducing the learning rate when the ratio is large, loss spikes are avoided and training stability is improved. ### Experimental results - **Accelerating training**: - **Accuracy**: SwitchBack can match the performance of bfloat16 training at both int8 and fp8 precisions. Especially on the CLIP ViT - Huge model, the accuracy difference is only 0.1 percentage points. - **Speed**: In the end - to - end CLIP training test on a single - node with 4 A100 GPUs, the acceleration range of SwitchBack compared to the standard PyTorch layer varies from 5% to 35%. - **Stabilizing training**: - **Loss curve**: The experimental results show that the model using the StableAdamW optimizer has no loss spikes during the training process, while the models using other methods have obvious loss fluctuations. - **Feature magnitude**: By initializing the layer scale to zero, the feature magnitude is successfully controlled, and the divergence problem in low - precision training is avoided. ### Summary This paper solves the acceleration and stability problems in the training of large - scale vision - language models by introducing the SwitchBack linear layer and the StableAdamW optimizer respectively, providing an effective technical solution for future large - scale model training.

Stable and low-precision training for large-scale vision-language models

VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models

Scaling FP8 training to trillion-token LLMs

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

Balancing Speed and Stability: The Trade-offs of FP8 vs. BF16 Training in LLMs

Phantom of Latent for Large Language and Vision Models

Q-VLM: Post-training Quantization for Large Vision-Language Models

Rethinking Overlooked Aspects in Vision-Language Models

FP8-LM: Training FP8 Large Language Models

To FP8 and Back Again: Quantifying the Effects of Reducing Precision on LLM Training Stability

Spike No More: Stabilizing the Pre-training of Large Language Models

EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning

Accelerating Large Batch Training via Gradient Signal to Noise Ratio (GSNR)

ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification and KV Cache Compression

Methods of improving LLM training stability

POINTS: Improving Your Vision-language Model with Affordable Strategies

Efficient Low-rank Backpropagation for Vision Transformer Adaptation

Self-Adapting Large Visual-Language Models to Edge Devices across Visual Modalities

TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones

Accelerating Vision-Language Pretraining with Free Language Modeling

Enhancing Stability for Large Language Models Training in Constrained Bandwidth Networks