Abstract:Large deep learning models have demonstrated strong ability to solve many tasks across a wide range of applications. Those large models typically require training and inference to be distributed. Tensor parallelism is a common technique partitioning computation of an operation or layer across devices to overcome the memory capacity limitation of a single processor, and/or to accelerate computation to meet a certain latency requirement. However, this kind of parallelism introduces additional communication that might contribute a significant portion of overall runtime. Thus limits scalability of this technique within a group of devices with high speed interconnects, such as GPUs with NVLinks in a node. This paper proposes a novel method, Flux, to significantly hide communication latencies with dependent computations for GPUs. Flux over-decomposes communication and computation operations into much finer-grained operations and further fuses them into a larger kernel to effectively hide communication without compromising kernel efficiency. Flux can potentially overlap up to 96% of communication given a fused kernel. Overall, it can achieve up to 1.24x speedups for training over Megatron-LM on a cluster of 128 GPUs with various GPU generations and interconnects, and up to 1.66x and 1.30x speedups for prefill and decoding inference over vLLM on a cluster with 8 GPUs with various GPU generations and interconnects.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper "FLUX: A Software Approach for Fast Communication Overlap on GPUs via Kernel Fusion" aims to address the additional communication overhead introduced by tensor parallelism in distributed training and inference of large deep learning models. Specifically, tensor parallelism overcomes the memory limitations of a single processor or accelerates computation to meet specific latency requirements by distributing computational tasks across multiple devices. However, this parallelism introduces additional communication overhead, which can constitute a significant portion of the total runtime, thereby limiting the scalability of the technique in high-bandwidth interconnect devices such as GPUs connected via NVLink within a node. ### Solution To address this issue, the paper proposes a novel approach called FLUX, which effectively hides communication latency without sacrificing kernel efficiency by finely decomposing communication and computation operations and fusing them into a larger kernel. FLUX can potentially overlap up to 96% of communication time, achieving up to 1.24x training speedup on a 128-GPU cluster, up to 1.66x prefill inference speedup, and 1.30x decoding inference speedup on an 8-GPU cluster. ### Main Contributions 1. **Identifying Performance Issues of Existing Communication Overlap Techniques on GPUs**: The paper analyzes the performance bottlenecks of existing communication overlap techniques when applied to GPUs. 2. **Proposing a New Communication Overlap Technique**: FLUX better adapts to modern GPU designs by finely decomposing and fusing communication and computation operations. 3. **Implementation and Optimization**: The technique is implemented using NVIDIA CUTLASS and optimized for different generations of GPUs (such as A100 and H800) and different intra-node interconnects (such as PCIe and NVLink). 4. **Evaluation and Results**: Training evaluations on multiple 128-GPU clusters and inference evaluations on multiple 8-GPU clusters show that FLUX achieves significant speedup in training, prefill, and decoding inference. ### Background and Traditional Methods - **Tensor Parallelism and Communication Patterns**: The paper discusses common tensor parallelism strategies and their communication patterns, such as the communication patterns in the forward and backward passes of the multi-layer perceptron (MLP) section. - **Traditional Communication Overlap Strategies**: Existing communication overlap methods achieve overlap by decomposing computation and communication operations into chunks and carefully scheduling these operations. However, these methods have limitations in terms of execution order, concurrent execution, and execution timing control on GPUs. ### Key Techniques of FLUX - **Fine-Grained Decomposition and Fusion**: FLUX decomposes communication and computation operations into finer-grained blocks and fuses these blocks into a larger kernel, with each dependent computation and communication block mapped to a thread block. - **Optimizing Communication and Computation**: Optimization measures include kernel fusion, coordinate reordering, GPU instruction selection, and communication sequence selection, enabling FLUX to better adapt to GPU architectures and interconnect methods. ### Experimental Results - **Performance Improvement**: FLUX demonstrates significant performance improvements across various tasks and configurations, especially in communication-intensive tasks. - **Effective Communication Time and Overlap Efficiency**: By defining effective communication time and overlap efficiency, the paper demonstrates the effectiveness of FLUX in reducing communication overhead. In summary, FLUX effectively addresses the communication bottleneck in distributed training and inference of large deep learning models through fine-grained communication and computation fusion, significantly improving overall system performance.

FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion

FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion

OF-WFBP: A near-optimal communication mechanism for tensor fusion in distributed deep learning

FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression

Flash Communication: Reducing Tensor Parallelization Bottleneck for Fast Large Language Model Inference

FlashDecoding++: Faster Large Language Model Inference on GPUs

FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs

ParallelFusion

Flover: A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference

Sub-model Parallelism: A Scale-out Deployment Method for Large Multi-modal DNNs

Exploiting Simultaneous Communications to Accelerate Data Parallel Distributed Deep Learning

NanoFlow: Towards Optimal Large Language Model Serving Throughput

An Efficient 2D Method for Training Super-Large Deep Learning Models

Accelerating Deep Learning Inference with Cross-Layer Data Reuse on GPUs

On Optimizing the Communication of Model Parallelism

Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading

FLARE: Flexibly Sharing Commodity GPUs to Enforce QoS and Improve Utilization

FlashFlex: Accommodating Large Language Model Training over Heterogeneous Environment

FLNA: an Energy-Efficient Point Cloud Feature Learning Accelerator with Dataflow Decoupling.

Accelerating Deep Learning Inference via Model Parallelism and Partial Computation Offloading

FPDeep: Scalable Acceleration of CNN Training on Deeply-Pipelined FPGA Clusters