Abstract:Recent years have witnessed a clear trend towards language models with an ever-increasing number of parameters, as well as the growing training overhead and memory usage. Distributed training, particularly through Sharded Data Parallelism (ShardedDP) which partitions optimizer states among workers, has emerged as a crucial technique to mitigate training time and memory usage. Yet, a major challenge in the scalability of ShardedDP is the intensive communication of weights and gradients. While compression techniques can alleviate this issue, they often result in worse accuracy. Driven by this limitation, we propose SDP4Bit (Toward 4Bit Communication Quantization in Sharded Data Parallelism for LLM Training), which effectively reduces the communication of weights and gradients to nearly 4 bits via two novel techniques: quantization on weight differences, and two-level gradient smooth quantization. Furthermore, SDP4Bit presents an algorithm-system co-design with runtime optimization to minimize the computation overhead of compression. In addition to the theoretical guarantees of convergence, we empirically evaluate the accuracy of SDP4Bit on the pre-training of GPT models with up to 6.7 billion parameters, and the results demonstrate a negligible impact on training loss. Furthermore, speed experiments show that SDP4Bit achieves up to 4.08$\times$ speedup in end-to-end throughput on a scale of 128 GPUs.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of excessive communication overhead when using Sharded Data Parallelism (ShardedDP) during the training of large - scale language models (LLMs). Specifically: 1. **Communication overhead problem**: As the number of model parameters continues to increase, the training time and memory usage also increase significantly. Distributed training, especially splitting the optimizer state through ShardedDP, has become a key technique for alleviating training time and memory usage. However, the main challenge of ShardedDP lies in the intensive communication of weights and gradients. 2. **Limitations of compression techniques**: Although compression techniques can reduce the communication burden, they usually lead to a decline in model accuracy. In existing research, QSDP and ZeRO++ attempt to quantize the communication of ShardedDP into 4 - bit integers (Int4), but they cannot maintain training losses comparable to the baseline under extreme compression ratios and lack theoretical convergence guarantees. 3. **Objective**: To overcome these limitations, the authors propose SDP4Bit, a new communication - reduction strategy that effectively compresses the communication of weights and gradients to nearly 4 bits through the following two innovative techniques: - **Quantization on Weight Differences**: Quantize the weight differences between the current and the previous iteration into 4 bits. - **Two - Level Gradient Smooth Quantization**: Quantize the intra - node gradients into 8 bits, the cross - node gradients into 4 bits, and use the Hadamard transform to smooth outliers. In addition, SDP4Bit also minimizes the computational overhead caused by compression through algorithm - system co - design and runtime optimization. Experimental results show that SDP4Bit hardly affects the training loss in the pre - training of the GPT model with 6.7 billion parameters and achieves an end - to - end throughput acceleration of up to 4.08 times on 128 GPUs. ### Main contributions - Propose a low - bit (close to 4 - bit) communication - reduction strategy for ShardedDP while maintaining end - to - end training accuracy. - Establish convergence guarantees, show the same convergence rate as ordinary stochastic gradient descent (SGD), and expand the selection range of bias compressors and weaker assumption conditions. - Implement this method in the Megatron - LM framework and further enhance performance through runtime optimizations such as buffer reuse, operation pruning, and kernel fusion. - Experimentally verify that SDP4Bit successfully compresses the communication of weights and gradients to nearly 4 bits, hardly affects the final loss, and significantly improves the training speed. Through these improvements, SDP4Bit provides an effective solution for the efficient distributed training of large - scale language models.

SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data Parallelism for LLM Training

BinSGDM: Extreme One-Bit Quantization for Communication Efficient Large-Scale Distributed Training

Accelerating Large Language Model Training with Hybrid GPU-based Compression

OSDP: Optimal Sharded Data Parallel for Distributed Deep Learning

Towards Low-bit Communication for Tensor Parallel LLM Inference

Fast and Efficient 2-bit LLM Inference on GPU: 2/4/16-bit in a Weight Matrix with Asynchronous Dequantization

Double Quantization for Communication-Efficient Distributed Optimization

A 4D Hybrid Algorithm to Scale Parallel Training to Thousands of GPUs

P4SGD: Programmable Switch Enhanced Model-Parallel Training on Generalized Linear Models on Distributed FPGAs

SDQ: Sparse Decomposed Quantization for LLM Inference

P4SGD: Programmable Switch Enhanced Model-Parallel Training on Generalized Linear Models on Distributed FPGAs

Slim-DP: A Light Communication Data Parallelism for DNN

LuWu: An End-to-End In-Network Out-of-Core Optimizer for 100B-Scale Model-in-Network Data-Parallel Training on Distributed GPUs

Exploring Fast Algorithms for Composite Optimization with Serial and Asynchronous Realizations

Exploring Fast Algorithms for Composite Optimization with Serial and Asynchronous Realizations.

Accelerating Large Language Model Training with 4D Parallelism and Memory Consumption Estimator

SmoothQuant+: Accurate and Efficient 4-bit Post-Training WeightQuantization for LLM

A Quantitative Survey of Communication Optimizations in Distributed Deep Learning

Optimizing Distributed Training on Frontier for Large Language Models

SparDL: Distributed Deep Learning Training with Efficient Sparse Communication

EP4DDL: addressing straggler problem in heterogeneous distributed deep learning