FlashFlex: Accommodating Large Language Model Training over Heterogeneous Environment

Ran Yan,Youhe Jiang,Wangcheng Tao,Xiaonan Nie,Bin Cui,Binhang Yuan
2024-09-02
Abstract:Training large language model (LLM) is a computationally intensive task, which is typically conducted in data centers with homogeneous high-performance GPUs. This paper explores an alternative approach by deploying the training computation across heterogeneous GPUs to enable better flexibility and efficiency for heterogeneous resource utilization. To achieve this goal, we propose a novel system, FlashFlex, that can flexibly support an asymmetric partition of the parallel training computations across the scope of data-, pipeline-, and tensor model parallelism. We further formalize the allocation of asymmetric partitioned training computations over a set of heterogeneous GPUs as a constrained optimization problem and propose an efficient solution based on a hierarchical graph partitioning algorithm. Our approach can adaptively allocate asymmetric training computations across GPUs, fully leveraging the available computational power. We conduct extensive empirical studies to evaluate the performance of FlashFlex, where we find that when training LLMs at different scales (from 7B to 30B), FlashFlex can achieve comparable training MFU when running over a set of heterogeneous GPUs compared with the state of the art training systems running over a set of homogeneous high-performance GPUs with the same amount of total peak FLOPS. The achieved smallest gaps in MFU are 11.61% and 0.30%, depending on whether the homogeneous setting is equipped with and without RDMA. Our implementation is available at <a class="link-external link-https" href="https://github.com/Relaxed-System-Lab/FlashFlex" rel="external noopener nofollow">this https URL</a>.
Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the inefficiency in the utilization of computing resources and the high cost during the training process of large - language models (LLMs). Specifically, traditional LLMs training usually depends on homogeneous high - performance GPU clusters within data centers, which not only limits the flexibility of training tasks but also leads to high deployment costs. To address these issues, the paper proposes a new system - FlashFlex, aiming to improve the flexibility and efficiency of LLMs training in the following ways: - **Distributed training in heterogeneous environments**: Explore the possibility of deploying parallel training computations in heterogeneous GPU environments to make better use of existing hardware resources, reduce costs, and increase resource utilization. - **Asymmetric partitioning strategy**: Support asymmetric partitioning in data parallelism, pipeline parallelism, and tensor model parallelism, thus more flexibly adapting to GPUs with different performances. - **Optimized scheduling algorithm**: Formalize the problem of training computation allocation on heterogeneous GPUs as a constrained optimization problem, and propose an effective solution based on the hierarchical graph partitioning algorithm to ensure that training tasks can fully utilize the computing capabilities of various GPUs. Through these improvements, FlashFlex can achieve performance comparable to the existing best homogeneous training systems in heterogeneous environments while significantly reducing training costs and time. For example, in the case of the same total peak FLOPS, in the LLMs training of 7B - 30B scale, the minimum MFU gap of FlashFlex is only 11.61% and 0.30% depending on whether RDMA is equipped or not. In summary, the main objective of this paper is to solve the problems of insufficient resource utilization and excessive cost in current LLMs training by introducing the FlashFlex system, thereby promoting the further development and popularization of LLMs technology.