Abstract:To alleviate hardware scarcity in training large deep neural networks (DNNs), particularly large language models (LLMs), we present FusionLLM, a decentralized training system designed and implemented for training DNNs using geo-distributed GPUs across different computing clusters or individual devices. Decentralized training faces significant challenges regarding system design and efficiency, including: 1) the need for remote automatic differentiation (RAD), 2) support for flexible model definitions and heterogeneous software, 3) heterogeneous hardware leading to low resource utilization or the straggler problem, and 4) slow network communication. To address these challenges, in the system design, we represent the model as a directed acyclic graph of operators (OP-DAG). Each node in the DAG represents the operator in the DNNs, while the edge represents the data dependency between operators. Based on this design, 1) users are allowed to customize any DNN without caring low-level operator implementation; 2) we enable the task scheduling with the more fine-grained sub-tasks, offering more optimization space; 3) a DAG runtime executor can implement RAD withour requiring the consistent low-level ML framework versions. To enhance system efficiency, we implement a workload estimator and design an OP-Fence scheduler to cluster devices with similar bandwidths together and partition the DAG to increase throughput. Additionally, we propose an AdaTopK compressor to adaptively compress intermediate activations and gradients at the slowest communication links. To evaluate the convergence and efficiency of our system and algorithms, we train ResNet-101 and GPT-2 on three real-world testbeds using 48 GPUs connected with 8 Mbps~10 Gbps networks. Experimental results demonstrate that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.

Comprehensive techniques for multi-tenant deep learning framework on a Hadoop YARN cluster

Multi-Tenant Machine Learning Platform Based on Kubernetes.

Vhadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration

MR-ELM: a MapReduce-based framework for large-scale ELM training in big data era

Distributed Parallel Deep Learning of Hierarchical Extreme Learning Machine for Multimode Quality Prediction with Big Process Data

Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads

GACER: Granularity-Aware ConcurrEncy Regulation for Multi-Tenant Deep Learning

HPH: Hybrid Parallelism on Heterogeneous Clusters for Accelerating Large-scale DNNs Training.

An Optimal Network-Aware Scheduling Technique for Distributed Deep Learning in Distributed HPC Platforms

FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression

A Hybrid Data and Model Transfer Framework for Distributed Machine Learning

Efficient Device Scheduling with Multi-Job Federated Learning

Theano-MPI: a Theano-based Distributed Training Framework

Locality-aware and Fault-tolerant Batching for Machine Learning on Distributed Datasets

Elastic Deep Learning in Multi-Tenant GPU Clusters

Research on the Framework and Resource Scheduling Mechanisms of Hadoop YARN

Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs

A Multi-Head Ensemble Multi-Task Learning Approach for Dynamical Computation Offloading

HAP: SPMD DNN Training on Heterogeneous GPU Clusters with Automated Program Synthesis

M2M: A Fine-Grained Mapping Framework to Accelerate Multiple DNNs on a Multi-Chiplet Architecture

A Novel Co-design Peta-scale Heterogeneous Cluster for Deep Learning Training