Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs

Yixuan Mei,Yonghao Zhuang,Xupeng Miao,Juncheng Yang,Zhihao Jia,Rashmi Vinayak

2024-06-04

Abstract:This paper introduces Helix, a distributed system for high-throughput, low-latency large language model (LLM) serving on heterogeneous GPU clusters. A key idea behind Helix is to formulate inference computation of LLMs over heterogeneous GPUs and network connections as a max-flow problem for a directed, weighted graph, whose nodes represent GPU instances and edges capture both GPU and network heterogeneity through their capacities. Helix then uses a mixed integer linear programming (MILP) algorithm to discover highly optimized strategies to serve LLMs. This approach allows Helix to jointly optimize model placement and request scheduling, two highly entangled tasks in heterogeneous LLM serving. Our evaluation on several heterogeneous cluster settings ranging from 24 to 42 GPU nodes shows that Helix improves serving throughput by up to 2.7$\times$ and reduces prompting and decoding latency by up to 2.8$\times$ and 1.3$\times$, respectively, compared to best existing approaches.

Distributed, Parallel, and Cluster Computing,Machine Learning

What problem does this paper attempt to address?

The paper primarily addresses the issue of efficiently serving large language models (LLMs) on heterogeneous GPU clusters. Specifically, the paper proposes the Helix system, whose core idea is to model the execution problem of LLMs on heterogeneous GPU clusters as a max-flow problem and to find the optimal service strategy through a mixed-integer linear programming (MILP) algorithm. The Helix system aims to overcome the following challenges: 1. **Model Partitioning and Device Allocation**: As the scale of LLMs grows, a single GPU can no longer meet the storage and computation demands, necessitating the use of multiple GPUs for parallel processing of the model. Traditional uniform partitioning methods cannot fully utilize the capabilities of high-performance GPUs. Helix optimizes model partitioning and allocation among different GPUs by modeling it as a max-flow problem. 2. **Request Scheduling**: To efficiently handle real-time inference requests, Helix introduces per-request pipelines instead of fixed pipelines, which can better adapt to heterogeneous computing and network conditions. Key contributions of Helix include: - The first design of an LLM service system for heterogeneous GPU clusters that achieves high throughput and low latency. - Modeling the LLM service problem as a max-flow problem and using the MILP algorithm to optimize model placement. - Introducing dynamic pipelines for each request to maximize GPU utilization. - Implementing and evaluating the technique, demonstrating significant performance improvements in various LLM benchmarks. The paper experimentally validates the effectiveness of Helix on heterogeneous clusters of different scales. Compared to existing heterogeneous-aware baselines, Helix significantly improves service throughput while reducing average prompt and decoding latency.

Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs

Hiding Communication Cost in Distributed LLM Training via Micro-batch Co-execution

Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity

HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices

LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization

Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment

High-throughput Generative Inference of Large Language Models with a Single GPU

Enhanced Hybrid Hierarchical Federated Edge Learning Over Heterogeneous Networks

FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines

HexGen: Generative Inference of Large Language Model over Heterogeneous Environment

NanoFlow: Towards Optimal Large Language Model Serving Throughput

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

HETHUB: A Distributed Training System with Heterogeneous Cluster for Large-Scale Models

Efficient LLM inference solution on Intel GPU

Efficient Large-Scale Language Model Training on GPU Clusters

FlashFlex: Accommodating Large Language Model Training over Heterogeneous Environment

FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs

Efficient Deployment of Large Language Model Across Cloud-Device Systems

FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion

Fast Distributed Inference Serving for Large Language Models