SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data Parallelism for LLM Training

Jinda Jia,Cong Xie,Hanlin Lu,Daoce Wang,Hao Feng,Chengming Zhang,Baixi Sun,Haibin Lin,Zhi Zhang,Xin Liu,Dingwen Tao
2024-10-21
Abstract:Recent years have witnessed a clear trend towards language models with an ever-increasing number of parameters, as well as the growing training overhead and memory usage. Distributed training, particularly through Sharded Data Parallelism (ShardedDP) which partitions optimizer states among workers, has emerged as a crucial technique to mitigate training time and memory usage. Yet, a major challenge in the scalability of ShardedDP is the intensive communication of weights and gradients. While compression techniques can alleviate this issue, they often result in worse accuracy. Driven by this limitation, we propose SDP4Bit (Toward 4Bit Communication Quantization in Sharded Data Parallelism for LLM Training), which effectively reduces the communication of weights and gradients to nearly 4 bits via two novel techniques: quantization on weight differences, and two-level gradient smooth quantization. Furthermore, SDP4Bit presents an algorithm-system co-design with runtime optimization to minimize the computation overhead of compression. In addition to the theoretical guarantees of convergence, we empirically evaluate the accuracy of SDP4Bit on the pre-training of GPT models with up to 6.7 billion parameters, and the results demonstrate a negligible impact on training loss. Furthermore, speed experiments show that SDP4Bit achieves up to 4.08$\times$ speedup in end-to-end throughput on a scale of 128 GPUs.
Machine Learning,Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of excessive communication overhead when using Sharded Data Parallelism (ShardedDP) during the training of large - scale language models (LLMs). Specifically: 1. **Communication overhead problem**: As the number of model parameters continues to increase, the training time and memory usage also increase significantly. Distributed training, especially splitting the optimizer state through ShardedDP, has become a key technique for alleviating training time and memory usage. However, the main challenge of ShardedDP lies in the intensive communication of weights and gradients. 2. **Limitations of compression techniques**: Although compression techniques can reduce the communication burden, they usually lead to a decline in model accuracy. In existing research, QSDP and ZeRO++ attempt to quantize the communication of ShardedDP into 4 - bit integers (Int4), but they cannot maintain training losses comparable to the baseline under extreme compression ratios and lack theoretical convergence guarantees. 3. **Objective**: To overcome these limitations, the authors propose SDP4Bit, a new communication - reduction strategy that effectively compresses the communication of weights and gradients to nearly 4 bits through the following two innovative techniques: - **Quantization on Weight Differences**: Quantize the weight differences between the current and the previous iteration into 4 bits. - **Two - Level Gradient Smooth Quantization**: Quantize the intra - node gradients into 8 bits, the cross - node gradients into 4 bits, and use the Hadamard transform to smooth outliers. In addition, SDP4Bit also minimizes the computational overhead caused by compression through algorithm - system co - design and runtime optimization. Experimental results show that SDP4Bit hardly affects the training loss in the pre - training of the GPT model with 6.7 billion parameters and achieves an end - to - end throughput acceleration of up to 4.08 times on 128 GPUs. ### Main contributions - Propose a low - bit (close to 4 - bit) communication - reduction strategy for ShardedDP while maintaining end - to - end training accuracy. - Establish convergence guarantees, show the same convergence rate as ordinary stochastic gradient descent (SGD), and expand the selection range of bias compressors and weaker assumption conditions. - Implement this method in the Megatron - LM framework and further enhance performance through runtime optimizations such as buffer reuse, operation pruning, and kernel fusion. - Experimentally verify that SDP4Bit successfully compresses the communication of weights and gradients to nearly 4 bits, hardly affects the final loss, and significantly improves the training speed. Through these improvements, SDP4Bit provides an effective solution for the efficient distributed training of large - scale language models.