Abstract:To alleviate hardware scarcity in training large deep neural networks (DNNs), particularly large language models (LLMs), we present FusionLLM, a decentralized training system designed and implemented for training DNNs using geo-distributed GPUs across different computing clusters or individual devices. Decentralized training faces significant challenges regarding system design and efficiency, including: 1) the need for remote automatic differentiation (RAD), 2) support for flexible model definitions and heterogeneous software, 3) heterogeneous hardware leading to low resource utilization or the straggler problem, and 4) slow network communication. To address these challenges, in the system design, we represent the model as a directed acyclic graph of operators (OP-DAG). Each node in the DAG represents the operator in the DNNs, while the edge represents the data dependency between operators. Based on this design, 1) users are allowed to customize any DNN without caring low-level operator implementation; 2) we enable the task scheduling with the more fine-grained sub-tasks, offering more optimization space; 3) a DAG runtime executor can implement RAD withour requiring the consistent low-level ML framework versions. To enhance system efficiency, we implement a workload estimator and design an OP-Fence scheduler to cluster devices with similar bandwidths together and partition the DAG to increase throughput. Additionally, we propose an AdaTopK compressor to adaptively compress intermediate activations and gradients at the slowest communication links. To evaluate the convergence and efficiency of our system and algorithms, we train ResNet-101 and GPT-2 on three real-world testbeds using 48 GPUs connected with 8 Mbps~10 Gbps networks. Experimental results demonstrate that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve The paper aims to address the issue of hardware resource scarcity during the training of large deep neural networks, especially large-scale language models (LLMs). Specifically, the paper proposes a decentralized training system named FusionLLM, which can efficiently train on geographically distributed GPUs, alleviating the following challenges: 1. **Remote Automatic Differentiation (RAD)**: Current machine learning frameworks do not support automatic differentiation over the internet, hindering seamless remote computation graph processing and gradient computation. 2. **Flexible Model Definition and Heterogeneous Software Environments**: Actual participants often use different software environments, including different versions of CUDA and machine learning frameworks, making synchronization of these environments very difficult. 3. **Heterogeneous Hardware Performance**: The hardware configurations provided by different computing nodes vary greatly, including GPU and CPU architectures, memory capacity, network bandwidth, and computing power. This heterogeneity leads to inconsistent task completion times, especially during forward and backward propagation, causing the "straggler" problem. 4. **Low Network Bandwidth**: Geographically distributed devices typically communicate over the internet, and the low bandwidth of the internet (10 Mbps to 10 Gbps) results in unacceptable communication times, especially in LLM training that requires extensive data exchange. ### Solution To address the above challenges, the FusionLLM system is designed with the following key components and technologies: 1. **Operator Directed Acyclic Graph (OP-DAG)**: The model is abstracted as a directed acyclic graph of operators, with each node representing an operator or layer in the DNN, and edges representing data dependencies between operators. Based on this design, users can customize any DNN without worrying about the underlying operator implementation; the system can perform finer-grained task scheduling, providing more optimization space; the DAG runtime executor can achieve RAD without requiring consistent low-level ML framework versions. 2. **Workload Estimator**: Estimates the computational workload for each operator and designs an OP-Fence scheduler that clusters devices with similar bandwidth together, partitioning the DAG to improve throughput. 3. **AdaTopK Compressor**: Proposes an adaptive compression mechanism that selectively compresses intermediate activations and gradients on the slowest communication links, ensuring system performance while maintaining training convergence. ### Experimental Results Experimental results show that the FusionLLM system and methods achieve a 1.45 to 9.39 times speedup compared to baseline methods when training ResNet-101 and GPT-2 using 48 heterogeneous GPUs (network bandwidth ranging from 8 Mbps to 10 Gbps) on three real test platforms, while ensuring training convergence. ### Contributions 1. **Identified and analyzed key challenges and optimization opportunities in decentralized training**. 2. **Designed a general decentralized training system, FusionLLM**, supporting remote automatic differentiation, flexible model definition, and heterogeneous software environments. 3. **Implemented a workload estimator for each layer**, analyzed overall throughput, and designed a scheduler to reduce communication overhead and improve system throughput. 4. **Proposed an adaptive compression mechanism, AdaTopK**, for compressing intermediate activations and gradients. 5. **Implemented the FusionLLM system** and conducted experiments on three clusters to demonstrate its effectiveness and superiority.

FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression

FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs

FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion

GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory Stitching

Accelerating Large Language Model Training with Hybrid GPU-based Compression

Decentralized Training of Foundation Models in Heterogeneous Environments

Optimizing Distributed Training on Frontier for Large Language Models

DistTrain: Addressing Model and Data Heterogeneity with Disaggregated Training for Multimodal Large Language Models

FlashFlex: Accommodating Large Language Model Training over Heterogeneous Environment

Cloudless-Training: A Framework to Improve Efficiency of Geo-Distributed ML Training

LuWu: An End-to-End In-Network Out-of-Core Optimizer for 100B-Scale Model-in-Network Data-Parallel Training on Distributed GPUs

Efficient Training of Large Language Models on Distributed Infrastructures: A Survey

Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment

Efficient Deployment of Large Language Model Across Cloud-Device Systems

HPH: Hybrid Parallelism on Heterogeneous Clusters for Accelerating Large-scale DNNs Training.

An Efficient 2D Method for Training Super-Large Deep Learning Models

CoopFL: Accelerating federated learning with DNN partitioning and offloading in heterogeneous edge computing

Poseidon: A System Architecture for Efficient GPU-based Deep Learning on Multiple Machines

Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters

Hiding Communication Cost in Distributed LLM Training via Micro-batch Co-execution