Abstract:The ever-increasing sizes of large language models necessitate distributed solutions for fast inference that exploit multi-dimensional parallelism, where computational loads are split across various accelerators such as GPU clusters. However, this approach often introduces significant communication overhead, especially on devices with limited bandwidth. In this paper, we introduce \emph{Flash Communication}, a novel low-bit compression technique designed to alleviate the tensor-parallelism communication bottleneck during inference. Our method substantially boosts intra-node communication speed by more than 3x and reduces the \emph{time-to-first-token} by 2x, with nearly no sacrifice in model accuracy. Extensive experiments on various up-to-date LLMs demonstrate the effectiveness of our approach.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the communication bottleneck problem caused by tensor parallelism during the inference process of large - scale language models (LLMs). As the number of parameters in large - scale language models continues to increase, distributed inference solutions are becoming more and more important. However, such distributed methods often introduce significant communication overhead between devices, especially on devices with limited bandwidth. Specifically, the paper points out that during inference, the communication overhead may account for more than 65% of the total latency (as shown in Figure 1). For language models with a large number of parameters (such as LLaMA - 3 - 70B), a large number of All - Reduce operations need to be performed in each forward pass, which further increases the communication burden. Therefore, how to effectively reduce these communication overheads while maintaining the accuracy of the model has become a key problem that needs to be solved urgently. ### Main contributions of the paper To solve the above problems, the paper makes the following main contributions: 1. **Reveal the communication bottleneck**: Through detailed measurements, the paper reveals the existence of the communication bottleneck problem during the inference process of large - scale language models. For example, on an NVIDIA L40 GPU, the communication overhead may account for 65% of the total latency. 2. **Design an efficient communication mechanism**: The paper proposes an efficient communication mechanism named "Flash Communication". This mechanism reduces the amount of communication by low - bit fine - grained quantization of activation values and adopts a two - step All - Reduce strategy to minimize the number of communication hops. 3. **Implement a fused CUDA kernel**: The paper implements a fused CUDA kernel named "Flash All - Reduce" for performing Flash Communication. The experimental results show that on an NVIDIA L40 GPU, this method can reduce the time - to - first - token (TTFT) by a factor of 2; even on the A100 GPU with higher bandwidth, a significant latency reduction is also observed, which proves the effectiveness of this method. ### Specific manifestations of the communication bottleneck By analyzing the operation cost of the LLaMA - 3 - 70B model (as shown in Figure 2), the paper finds that as the length of the input sequence increases, the communication overhead increases rapidly, especially when using the L40 GPU, the communication cost occupies most of the inference time. In addition, even for high - end training accelerators such as the NVIDIA A100, although the GPUs are connected by NVLink, the communication overhead still reaches a significant 20%. This indicates that tensor parallelism communication is the main bottleneck in the inference process. ### Core ideas of the solution To meet this challenge, the paper proposes a method that combines quantization and topological optimization. Specifically: - **Low - bit quantization**: Reduce the amount of communication data by low - bit fine - grained quantization of activation values. - **Two - step All - Reduce**: Adopt a two - step All - Reduce strategy to reduce the number of communication hops and thus reduce the latency. These methods work together to reduce communication overhead and improve the inference speed while尽量不影响模型的准确性.

Flash Communication: Reducing Tensor Parallelization Bottleneck for Fast Large Language Model Inference

Communication Compression for Tensor Parallel LLM Inference

Latency-minimizing Semantic Communication with Dynamic Model Partitioning

FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion

FlashDecoding++: Faster Large Language Model Inference on GPUs

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

Towards Low-bit Communication for Tensor Parallel LLM Inference

LLM in a flash: Efficient Large Language Model Inference with Limited Memory

TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression

Inference Performance Optimization for Large Language Models on CPUs

FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression

On Optimizing the Communication of Model Parallelism

Accelerating Large Language Model Training with Hybrid GPU-based Compression

DeFT: Decoding with Flash Tree-attention for Efficient Tree-structured LLM Inference

FlashFlex: Accommodating Large Language Model Training over Heterogeneous Environment

Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs

Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems

FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference