FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression

Zhenheng Tang,Xueze Kang,Yiming Yin,Xinglin Pan,Yuxin Wang,Xin He,Qiang Wang,Rongfei Zeng,Kaiyong Zhao,Shaohuai Shi,Amelie Chi Zhou,Bo Li,Bingsheng He,Xiaowen Chu
2024-10-17
Abstract:To alleviate hardware scarcity in training large deep neural networks (DNNs), particularly large language models (LLMs), we present FusionLLM, a decentralized training system designed and implemented for training DNNs using geo-distributed GPUs across different computing clusters or individual devices. Decentralized training faces significant challenges regarding system design and efficiency, including: 1) the need for remote automatic differentiation (RAD), 2) support for flexible model definitions and heterogeneous software, 3) heterogeneous hardware leading to low resource utilization or the straggler problem, and 4) slow network communication. To address these challenges, in the system design, we represent the model as a directed acyclic graph of operators (OP-DAG). Each node in the DAG represents the operator in the DNNs, while the edge represents the data dependency between operators. Based on this design, 1) users are allowed to customize any DNN without caring low-level operator implementation; 2) we enable the task scheduling with the more fine-grained sub-tasks, offering more optimization space; 3) a DAG runtime executor can implement RAD withour requiring the consistent low-level ML framework versions. To enhance system efficiency, we implement a workload estimator and design an OP-Fence scheduler to cluster devices with similar bandwidths together and partition the DAG to increase throughput. Additionally, we propose an AdaTopK compressor to adaptively compress intermediate activations and gradients at the slowest communication links. To evaluate the convergence and efficiency of our system and algorithms, we train ResNet-101 and GPT-2 on three real-world testbeds using 48 GPUs connected with 8 Mbps~10 Gbps networks. Experimental results demonstrate that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
Distributed, Parallel, and Cluster Computing,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve The paper aims to address the issue of hardware resource scarcity during the training of large deep neural networks, especially large-scale language models (LLMs). Specifically, the paper proposes a decentralized training system named FusionLLM, which can efficiently train on geographically distributed GPUs, alleviating the following challenges: 1. **Remote Automatic Differentiation (RAD)**: Current machine learning frameworks do not support automatic differentiation over the internet, hindering seamless remote computation graph processing and gradient computation. 2. **Flexible Model Definition and Heterogeneous Software Environments**: Actual participants often use different software environments, including different versions of CUDA and machine learning frameworks, making synchronization of these environments very difficult. 3. **Heterogeneous Hardware Performance**: The hardware configurations provided by different computing nodes vary greatly, including GPU and CPU architectures, memory capacity, network bandwidth, and computing power. This heterogeneity leads to inconsistent task completion times, especially during forward and backward propagation, causing the "straggler" problem. 4. **Low Network Bandwidth**: Geographically distributed devices typically communicate over the internet, and the low bandwidth of the internet (10 Mbps to 10 Gbps) results in unacceptable communication times, especially in LLM training that requires extensive data exchange. ### Solution To address the above challenges, the FusionLLM system is designed with the following key components and technologies: 1. **Operator Directed Acyclic Graph (OP-DAG)**: The model is abstracted as a directed acyclic graph of operators, with each node representing an operator or layer in the DNN, and edges representing data dependencies between operators. Based on this design, users can customize any DNN without worrying about the underlying operator implementation; the system can perform finer-grained task scheduling, providing more optimization space; the DAG runtime executor can achieve RAD without requiring consistent low-level ML framework versions. 2. **Workload Estimator**: Estimates the computational workload for each operator and designs an OP-Fence scheduler that clusters devices with similar bandwidth together, partitioning the DAG to improve throughput. 3. **AdaTopK Compressor**: Proposes an adaptive compression mechanism that selectively compresses intermediate activations and gradients on the slowest communication links, ensuring system performance while maintaining training convergence. ### Experimental Results Experimental results show that the FusionLLM system and methods achieve a 1.45 to 9.39 times speedup compared to baseline methods when training ResNet-101 and GPT-2 using 48 heterogeneous GPUs (network bandwidth ranging from 8 Mbps to 10 Gbps) on three real test platforms, while ensuring training convergence. ### Contributions 1. **Identified and analyzed key challenges and optimization opportunities in decentralized training**. 2. **Designed a general decentralized training system, FusionLLM**, supporting remote automatic differentiation, flexible model definition, and heterogeneous software environments. 3. **Implemented a workload estimator for each layer**, analyzed overall throughput, and designed a scheduler to reduce communication overhead and improve system throughput. 4. **Proposed an adaptive compression mechanism, AdaTopK**, for compressing intermediate activations and gradients. 5. **Implemented the FusionLLM system** and conducted experiments on three clusters to demonstrate its effectiveness and superiority.