Abstract:To alleviate hardware scarcity in training large deep neural networks (DNNs), particularly large language models (LLMs), we present FusionLLM, a decentralized training system designed and implemented for training DNNs using geo-distributed GPUs across different computing clusters or individual devices. Decentralized training faces significant challenges regarding system design and efficiency, including: 1) the need for remote automatic differentiation (RAD), 2) support for flexible model definitions and heterogeneous software, 3) heterogeneous hardware leading to low resource utilization or the straggler problem, and 4) slow network communication. To address these challenges, in the system design, we represent the model as a directed acyclic graph of operators (OP-DAG). Each node in the DAG represents the operator in the DNNs, while the edge represents the data dependency between operators. Based on this design, 1) users are allowed to customize any DNN without caring low-level operator implementation; 2) we enable the task scheduling with the more fine-grained sub-tasks, offering more optimization space; 3) a DAG runtime executor can implement RAD withour requiring the consistent low-level ML framework versions. To enhance system efficiency, we implement a workload estimator and design an OP-Fence scheduler to cluster devices with similar bandwidths together and partition the DAG to increase throughput. Additionally, we propose an AdaTopK compressor to adaptively compress intermediate activations and gradients at the slowest communication links. To evaluate the convergence and efficiency of our system and algorithms, we train ResNet-101 and GPT-2 on three real-world testbeds using 48 GPUs connected with 8 Mbps~10 Gbps networks. Experimental results demonstrate that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.

Supporting Very Large Models using Automatic Dataflow Graph Partitioning

An Efficient 2D Method for Training Super-Large Deep Learning Models

Unifying Data, Model and Hybrid Parallelism in Deep Learning via Tensor Tiling

TSPLIT: Fine-grained GPU Memory Management for Efficient DNN Training Via Tensor Splitting

Data-parallel distributed training of very large models beyond GPU capacity

Automatic Graph Partitioning for Very Large-scale Deep Learning

TFLMS: Large Model Support in TensorFlow by Graph Rewriting

Mobius: Fine Tuning Large-Scale Models on Commodity GPU Servers

Scalable CP Decomposition for Tensor Learning using GPU Tensor Cores

FastFlow: Accelerating Deep Learning Model Training with Smart Offloading of Input Data Pipeline

PrimePar: Efficient Spatial-temporal Tensor Partitioning for Large Transformer Model Training

ATOM: Asynchronous Training of Massive Models for Deep Learning in a Decentralized Environment

GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory Stitching

Parmesan: Efficient Partitioning and Mapping Flow for DNN Training on General Device Topology

FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression

NanoFlow: Towards Optimal Large Language Model Serving Throughput

Optimal distributed parallel algorithms for deep learning framework Tensorflow

TAP: Accelerating Large-Scale DNN Training Through Tensor Automatic Parallelisation

Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training

Hybrid Tensor Decomposition in Neural Network Compression

FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion