HETHUB: A Distributed Training System with Heterogeneous Cluster for Large-Scale Models

Si Xu,Zixiao Huang,Yan Zeng,Shengen Yan,Xuefei Ning,Quanlu Zhang,Haolin Ye,Sipei Gu,Chunsheng Shui,Zhezheng Lin,Hao Zhang,Sheng Wang,Guohao Dai,Yu Wang
2024-08-09
Abstract:Training large-scale models relies on a vast number of computing resources. For example, training the GPT-4 model (1.8 trillion parameters) requires 25000 A100 GPUs . It is a challenge to build a large-scale cluster with one type of GPU-accelerator. Using multiple types of GPU-accelerators to construct a large-scale cluster is an effective way to solve the problem of insufficient homogeneous GPU-accelerators. However, the existing distributed training systems for large-scale models only support homogeneous GPU-accelerators, not support heterogeneous GPU-accelerators. To address the problem, this paper proposes a distributed training system with hybrid parallelism, HETHUB, for large-scale models, which supports heterogeneous cluster, including AMD, Nvidia GPU and other types of GPU-accelerators . It introduces a distributed unified communicator to realize the communication between heterogeneous GPU-accelerators, a distributed performance predictor, and an automatic parallel planner to develop and train models efficiently with heterogeneous GPU-accelerators. Compared to the distributed training system with homogeneous GPU-accelerators, our system can support six combinations of heterogeneous GPU-accelerators. We train the Llama-140B model on a heterogeneous cluster with 768 GPU-accelerators(128 AMD and 640 GPU-accelerator A). The experiment results show that the optimal performance of our system in the heterogeneous cluster has achieved up to 97.49% of the theoretical upper bound performance.
Distributed, Parallel, and Cluster Computing,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve the challenges faced in using heterogeneous clusters (containing different types of GPU accelerators) in large - scale model training. Specifically, the paper points out: 1. **Communication challenges**: Different types of GPU accelerators cannot communicate directly because they use different communication libraries (for example, NVIDIA GPUs use NCCL, while other GPUs may use HCCL). This leads to communication barriers in heterogeneous clusters. 2. **Development and training challenges**: It is very difficult to design and implement the optimal distributed training strategy in a heterogeneous cluster. Due to the differences in computing and storage capabilities among different types of GPU accelerators, as well as the strong computing - communication coupling characteristics of large - scale models, the number of distributed strategies grows exponentially as the number of heterogeneous GPU accelerators, the number of model layers, or the number of operations increases. 3. **Accuracy challenges**: The differences in the precision of operations on different types of GPU accelerators will make it difficult for the model to reach the accuracy level of a homogeneous cluster. Existing distributed training systems (such as Megatron - LM, DeepSpeed, etc.) only support homogeneous clusters and do not support heterogeneous clusters. Therefore, these systems have limitations when dealing with large - scale models. To address these problems, the paper proposes a distributed training system named HETHUB, which supports heterogeneous clusters, including AMD, NVIDIA GPUs, and other types of GPU accelerators. ### Main contributions 1. **Distributed unified communicator**: A distributed unified communicator has been constructed to support communication between different GPU accelerators. This communicator includes a CPU - based communication library (using Ethernet or IPoIB) and a GPU - based communication library (using IB or RoCE), and defines a unified communication interface to adapt to multiple types of GPU accelerators. 2. **Distributed performance predictor**: A distributed performance predictor has been proposed to evaluate the distributed training strategy of the model on a heterogeneous cluster. By performing automatic analysis on a small - scale cluster, a performance evaluation model is established, and then this model is used for performance prediction to guide the decision - making of the distributed training strategy on a large - scale cluster. 3. **Automatic parallel planner**: An automatic parallel planner has been introduced, which can automatically search for the optimal distributed parallel strategy under a given model and heterogeneous cluster topology, improving development and model computing efficiency. 4. **Performance verification**: The performance and scalability of the system have been verified using the Llama - 140B model on a heterogeneous cluster containing 768 GPU accelerators (128 AMD GPU accelerators and 640 GPU accelerator A). The experimental results show that the optimal performance of the system in the heterogeneous cluster reaches 97.49% of the theoretical upper - limit performance. ### Summary The HETHUB system provides an effective method for training large - scale models by solving the communication, development, and training challenges in heterogeneous clusters, especially in the case of limited resources. This system not only improves training efficiency but also provides new possibilities for future large - scale model training.