Abstract:Training large language model (LLM) is a computationally intensive task, which is typically conducted in data centers with homogeneous high-performance GPUs. This paper explores an alternative approach by deploying the training computation across heterogeneous GPUs to enable better flexibility and efficiency for heterogeneous resource utilization. To achieve this goal, we propose a novel system, FlashFlex, that can flexibly support an asymmetric partition of the parallel training computations across the scope of data-, pipeline-, and tensor model parallelism. We further formalize the allocation of asymmetric partitioned training computations over a set of heterogeneous GPUs as a constrained optimization problem and propose an efficient solution based on a hierarchical graph partitioning algorithm. Our approach can adaptively allocate asymmetric training computations across GPUs, fully leveraging the available computational power. We conduct extensive empirical studies to evaluate the performance of FlashFlex, where we find that when training LLMs at different scales (from 7B to 30B), FlashFlex can achieve comparable training MFU when running over a set of heterogeneous GPUs compared with the state of the art training systems running over a set of homogeneous high-performance GPUs with the same amount of total peak FLOPS. The achieved smallest gaps in MFU are 11.61% and 0.30%, depending on whether the homogeneous setting is equipped with and without RDMA. Our implementation is available at <a class="link-external link-https" href="https://github.com/Relaxed-System-Lab/FlashFlex" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the inefficiency in the utilization of computing resources and the high cost during the training process of large - language models (LLMs). Specifically, traditional LLMs training usually depends on homogeneous high - performance GPU clusters within data centers, which not only limits the flexibility of training tasks but also leads to high deployment costs. To address these issues, the paper proposes a new system - FlashFlex, aiming to improve the flexibility and efficiency of LLMs training in the following ways: - **Distributed training in heterogeneous environments**: Explore the possibility of deploying parallel training computations in heterogeneous GPU environments to make better use of existing hardware resources, reduce costs, and increase resource utilization. - **Asymmetric partitioning strategy**: Support asymmetric partitioning in data parallelism, pipeline parallelism, and tensor model parallelism, thus more flexibly adapting to GPUs with different performances. - **Optimized scheduling algorithm**: Formalize the problem of training computation allocation on heterogeneous GPUs as a constrained optimization problem, and propose an effective solution based on the hierarchical graph partitioning algorithm to ensure that training tasks can fully utilize the computing capabilities of various GPUs. Through these improvements, FlashFlex can achieve performance comparable to the existing best homogeneous training systems in heterogeneous environments while significantly reducing training costs and time. For example, in the case of the same total peak FLOPS, in the LLMs training of 7B - 30B scale, the minimum MFU gap of FlashFlex is only 11.61% and 0.30% depending on whether RDMA is equipped or not. In summary, the main objective of this paper is to solve the problems of insufficient resource utilization and excessive cost in current LLMs training by introducing the FlashFlex system, thereby promoting the further development and popularization of LLMs technology.

FlashFlex: Accommodating Large Language Model Training over Heterogeneous Environment

FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment

FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression

High-throughput Generative Inference of Large Language Models with a Single GPU

Flextron: Many-in-One Flexible Large Language Model

FlashDecoding++: Faster Large Language Model Inference on GPUs

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

Optimizing Distributed Training on Frontier for Large Language Models

FLASH: Heterogeneity-Aware Federated Learning at Scale

Efficient Deployment of Large Language Model Across Cloud-Device Systems

Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity

Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly

FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs

FlexModel: A Framework for Interpretability of Distributed Large Language Models

Efficient Large-Scale Language Model Training on GPU Clusters

Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems

FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion

DistTrain: Addressing Model and Data Heterogeneity with Disaggregated Training for Multimodal Large Language Models