Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs

Yixuan Mei,Yonghao Zhuang,Xupeng Miao,Juncheng Yang,Zhihao Jia,Rashmi Vinayak
2024-06-04
Abstract:This paper introduces Helix, a distributed system for high-throughput, low-latency large language model (LLM) serving on heterogeneous GPU clusters. A key idea behind Helix is to formulate inference computation of LLMs over heterogeneous GPUs and network connections as a max-flow problem for a directed, weighted graph, whose nodes represent GPU instances and edges capture both GPU and network heterogeneity through their capacities. Helix then uses a mixed integer linear programming (MILP) algorithm to discover highly optimized strategies to serve LLMs. This approach allows Helix to jointly optimize model placement and request scheduling, two highly entangled tasks in heterogeneous LLM serving. Our evaluation on several heterogeneous cluster settings ranging from 24 to 42 GPU nodes shows that Helix improves serving throughput by up to 2.7$\times$ and reduces prompting and decoding latency by up to 2.8$\times$ and 1.3$\times$, respectively, compared to best existing approaches.
Distributed, Parallel, and Cluster Computing,Machine Learning
What problem does this paper attempt to address?
The paper primarily addresses the issue of efficiently serving large language models (LLMs) on heterogeneous GPU clusters. Specifically, the paper proposes the Helix system, whose core idea is to model the execution problem of LLMs on heterogeneous GPU clusters as a max-flow problem and to find the optimal service strategy through a mixed-integer linear programming (MILP) algorithm. The Helix system aims to overcome the following challenges: 1. **Model Partitioning and Device Allocation**: As the scale of LLMs grows, a single GPU can no longer meet the storage and computation demands, necessitating the use of multiple GPUs for parallel processing of the model. Traditional uniform partitioning methods cannot fully utilize the capabilities of high-performance GPUs. Helix optimizes model partitioning and allocation among different GPUs by modeling it as a max-flow problem. 2. **Request Scheduling**: To efficiently handle real-time inference requests, Helix introduces per-request pipelines instead of fixed pipelines, which can better adapt to heterogeneous computing and network conditions. Key contributions of Helix include: - The first design of an LLM service system for heterogeneous GPU clusters that achieves high throughput and low latency. - Modeling the LLM service problem as a max-flow problem and using the MILP algorithm to optimize model placement. - Introducing dynamic pipelines for each request to maximize GPU utilization. - Implementing and evaluating the technique, demonstrating significant performance improvements in various LLM benchmarks. The paper experimentally validates the effectiveness of Helix on heterogeneous clusters of different scales. Compared to existing heterogeneous-aware baselines, Helix significantly improves service throughput while reducing average prompt and decoding latency.