HexGen: Generative Inference of Large Language Model over Heterogeneous Environment

Youhe Jiang,Ran Yan,Xiaozhe Yao,Yang Zhou,Beidi Chen,Binhang Yuan

2024-05-27

Abstract:Serving generative inference of the large language model is a crucial component of contemporary AI applications. This paper focuses on deploying such services in a heterogeneous and cross-datacenter setting to mitigate the substantial inference costs typically associated with a single centralized datacenter. Towards this end, we propose HexGen, a flexible distributed inference engine that uniquely supports the asymmetric partition of generative inference computations over both tensor model parallelism and pipeline parallelism and allows for effective deployment across diverse GPUs interconnected by a fully heterogeneous network. We further propose a sophisticated scheduling algorithm grounded in constrained optimization that can adaptively assign asymmetric inference computation across the GPUs to fulfill inference requests while maintaining acceptable latency levels. We conduct an extensive evaluation to verify the efficiency of HexGen by serving the state-of-the-art Llama-2 (70B) model. The results suggest that HexGen can choose to achieve up to 2.3 times lower latency deadlines or tolerate up to 4 times more request rates compared with the homogeneous baseline given the same budget.

Distributed, Parallel, and Cluster Computing

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper primarily addresses the issue of deploying large-scale language model (LLM) inference services in heterogeneous environments, including across data centers. Specifically: 1. **Reducing Costs**: - Current state-of-the-art large-scale language model inference services are typically deployed within a single centralized data center and require high-configuration homogeneous GPU clusters, leading to high costs. - The paper proposes deploying inference services in heterogeneous environments to reduce inference costs. 2. **Optimizing Scheduling Algorithms**: - Deploying large-scale language model inference services in heterogeneous environments faces numerous challenges, such as varying GPU computational capabilities and connection methods. - The paper introduces a distributed inference engine named HEXGEN, which supports asymmetric tensor model parallelism and pipeline parallelism strategies. It also designs a constraint-optimization-based scheduling algorithm to adapt to different GPU computational capabilities and network connections. 3. **Experimental Validation**: - Extensive experimental evaluations validate the effectiveness of HEXGEN, particularly when compared to homogeneous baselines, demonstrating lower latency or higher request handling capacity under the same budget. In summary, the paper aims to improve the efficiency and cost-effectiveness of large-scale language model inference services by leveraging heterogeneous resources, thereby reducing overall costs and enhancing service quality.

HexGen: Generative Inference of Large Language Model over Heterogeneous Environment

HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices

High-throughput Generative Inference of Large Language Models with a Single GPU

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

GUIDE: A Global Unified Inference Engine for Deploying Large Language Models in Heterogeneous Environments

Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs

HETHUB: A Distributed Training System with Heterogeneous Cluster for Large-Scale Models

Efficient Deployment of Large Language Model Across Cloud-Device Systems

Decentralized Training of Foundation Models in Heterogeneous Environments

Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems

Elixir: Train a Large Language Model on a Small GPU Cluster

HetHub: A Heterogeneous Distributed Hybrid Training System for Large-Scale Models

FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs

Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units.

HUGE2: a Highly Untangled Generative-model Engine for Edge-computing

Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment

Inference Performance Optimization for Large Language Models on CPUs

The Synergy of Speculative Decoding and Batching in Serving Large Language Models

Fast Distributed Inference Serving for Large Language Models

Efficient and Economic Large Language Model Inference with Attention Offloading

Efficient Large-Scale Language Model Training on GPU Clusters