Abstract:Training large-scale models relies on a vast number of computing resources. For example, training the GPT-4 model (1.8 trillion parameters) requires 25000 A100 GPUs . It is a challenge to build a large-scale cluster with one type of GPU-accelerator. Using multiple types of GPU-accelerators to construct a large-scale cluster is an effective way to solve the problem of insufficient homogeneous GPU-accelerators. However, the existing distributed training systems for large-scale models only support homogeneous GPU-accelerators, not support heterogeneous GPU-accelerators. To address the problem, this paper proposes a distributed training system with hybrid parallelism, HETHUB, for large-scale models, which supports heterogeneous cluster, including AMD, Nvidia GPU and other types of GPU-accelerators . It introduces a distributed unified communicator to realize the communication between heterogeneous GPU-accelerators, a distributed performance predictor, and an automatic parallel planner to develop and train models efficiently with heterogeneous GPU-accelerators. Compared to the distributed training system with homogeneous GPU-accelerators, our system can support six combinations of heterogeneous GPU-accelerators. We train the Llama-140B model on a heterogeneous cluster with 768 GPU-accelerators(128 AMD and 640 GPU-accelerator A). The experiment results show that the optimal performance of our system in the heterogeneous cluster has achieved up to 97.49% of the theoretical upper bound performance.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the challenges faced in using heterogeneous clusters (containing different types of GPU accelerators) in large - scale model training. Specifically, the paper points out: 1. **Communication challenges**: Different types of GPU accelerators cannot communicate directly because they use different communication libraries (for example, NVIDIA GPUs use NCCL, while other GPUs may use HCCL). This leads to communication barriers in heterogeneous clusters. 2. **Development and training challenges**: It is very difficult to design and implement the optimal distributed training strategy in a heterogeneous cluster. Due to the differences in computing and storage capabilities among different types of GPU accelerators, as well as the strong computing - communication coupling characteristics of large - scale models, the number of distributed strategies grows exponentially as the number of heterogeneous GPU accelerators, the number of model layers, or the number of operations increases. 3. **Accuracy challenges**: The differences in the precision of operations on different types of GPU accelerators will make it difficult for the model to reach the accuracy level of a homogeneous cluster. Existing distributed training systems (such as Megatron - LM, DeepSpeed, etc.) only support homogeneous clusters and do not support heterogeneous clusters. Therefore, these systems have limitations when dealing with large - scale models. To address these problems, the paper proposes a distributed training system named HETHUB, which supports heterogeneous clusters, including AMD, NVIDIA GPUs, and other types of GPU accelerators. ### Main contributions 1. **Distributed unified communicator**: A distributed unified communicator has been constructed to support communication between different GPU accelerators. This communicator includes a CPU - based communication library (using Ethernet or IPoIB) and a GPU - based communication library (using IB or RoCE), and defines a unified communication interface to adapt to multiple types of GPU accelerators. 2. **Distributed performance predictor**: A distributed performance predictor has been proposed to evaluate the distributed training strategy of the model on a heterogeneous cluster. By performing automatic analysis on a small - scale cluster, a performance evaluation model is established, and then this model is used for performance prediction to guide the decision - making of the distributed training strategy on a large - scale cluster. 3. **Automatic parallel planner**: An automatic parallel planner has been introduced, which can automatically search for the optimal distributed parallel strategy under a given model and heterogeneous cluster topology, improving development and model computing efficiency. 4. **Performance verification**: The performance and scalability of the system have been verified using the Llama - 140B model on a heterogeneous cluster containing 768 GPU accelerators (128 AMD GPU accelerators and 640 GPU accelerator A). The experimental results show that the optimal performance of the system in the heterogeneous cluster reaches 97.49% of the theoretical upper - limit performance. ### Summary The HETHUB system provides an effective method for training large - scale models by solving the communication, development, and training challenges in heterogeneous clusters, especially in the case of limited resources. This system not only improves training efficiency but also provides new possibilities for future large - scale model training.

HETHUB: A Distributed Training System with Heterogeneous Cluster for Large-Scale Models

HetHub: A Heterogeneous Distributed Hybrid Training System for Large-Scale Models

HPH: Hybrid Parallelism on Heterogeneous Clusters for Accelerating Large-scale DNNs Training.

Decentralized Training of Foundation Models in Heterogeneous Environments

Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment

HAP: SPMD DNN Training on Heterogeneous GPU Clusters with Automated Program Synthesis

HetSeq: Distributed GPU Training on Heterogeneous Infrastructure

Cephalo: Harnessing Heterogeneous GPU Clusters for Training Transformer Models

A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters

Efficient Large-Scale Language Model Training on GPU Clusters

Modeling the Training Iteration Time for Heterogeneous Distributed Deep Learning Systems.

EasyScale: Elastic Training with Consistent Accuracy and Improved Utilization on GPUs

Heter-Train: A Distributed Training Framework Based on Semi-Asynchronous Parallel Mechanism for Heterogeneous Intelligent Transportation Systems

HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed Training System

Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms

AEML: An Acceleration Engine for Multi-GPU Load-balancing in Distributed Heterogeneous Environment

A Novel Co-design Peta-scale Heterogeneous Cluster for Deep Learning Training

FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs

SAP-SGD: Accelerating Distributed Parallel Training with High Communication Efficiency on Heterogeneous Clusters

Horizontally Fused Training Array: An Effective Hardware Utilization Squeezer for Training Novel Deep Learning Models

FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression