Flash Communication: Reducing Tensor Parallelization Bottleneck for Fast Large Language Model Inference

Qingyuan Li,Bo Zhang,Liang Ye,Yifan Zhang,Wei Wu,Yerui Sun,Lin Ma,Yuchen Xie
2024-12-06
Abstract:The ever-increasing sizes of large language models necessitate distributed solutions for fast inference that exploit multi-dimensional parallelism, where computational loads are split across various accelerators such as GPU clusters. However, this approach often introduces significant communication overhead, especially on devices with limited bandwidth. In this paper, we introduce \emph{Flash Communication}, a novel low-bit compression technique designed to alleviate the tensor-parallelism communication bottleneck during inference. Our method substantially boosts intra-node communication speed by more than 3x and reduces the \emph{time-to-first-token} by 2x, with nearly no sacrifice in model accuracy. Extensive experiments on various up-to-date LLMs demonstrate the effectiveness of our approach.
Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the communication bottleneck problem caused by tensor parallelism during the inference process of large - scale language models (LLMs). As the number of parameters in large - scale language models continues to increase, distributed inference solutions are becoming more and more important. However, such distributed methods often introduce significant communication overhead between devices, especially on devices with limited bandwidth. Specifically, the paper points out that during inference, the communication overhead may account for more than 65% of the total latency (as shown in Figure 1). For language models with a large number of parameters (such as LLaMA - 3 - 70B), a large number of All - Reduce operations need to be performed in each forward pass, which further increases the communication burden. Therefore, how to effectively reduce these communication overheads while maintaining the accuracy of the model has become a key problem that needs to be solved urgently. ### Main contributions of the paper To solve the above problems, the paper makes the following main contributions: 1. **Reveal the communication bottleneck**: Through detailed measurements, the paper reveals the existence of the communication bottleneck problem during the inference process of large - scale language models. For example, on an NVIDIA L40 GPU, the communication overhead may account for 65% of the total latency. 2. **Design an efficient communication mechanism**: The paper proposes an efficient communication mechanism named "Flash Communication". This mechanism reduces the amount of communication by low - bit fine - grained quantization of activation values and adopts a two - step All - Reduce strategy to minimize the number of communication hops. 3. **Implement a fused CUDA kernel**: The paper implements a fused CUDA kernel named "Flash All - Reduce" for performing Flash Communication. The experimental results show that on an NVIDIA L40 GPU, this method can reduce the time - to - first - token (TTFT) by a factor of 2; even on the A100 GPU with higher bandwidth, a significant latency reduction is also observed, which proves the effectiveness of this method. ### Specific manifestations of the communication bottleneck By analyzing the operation cost of the LLaMA - 3 - 70B model (as shown in Figure 2), the paper finds that as the length of the input sequence increases, the communication overhead increases rapidly, especially when using the L40 GPU, the communication cost occupies most of the inference time. In addition, even for high - end training accelerators such as the NVIDIA A100, although the GPUs are connected by NVLink, the communication overhead still reaches a significant 20%. This indicates that tensor parallelism communication is the main bottleneck in the inference process. ### Core ideas of the solution To meet this challenge, the paper proposes a method that combines quantization and topological optimization. Specifically: - **Low - bit quantization**: Reduce the amount of communication data by low - bit fine - grained quantization of activation values. - **Two - step All - Reduce**: Adopt a two - step All - Reduce strategy to reduce the number of communication hops and thus reduce the latency. These methods work together to reduce communication overhead and improve the inference speed while尽量不影响模型的准确性.