Abstract:Recent breakthroughs in Large-scale language models (LLMs) have demonstrated impressive performance on various tasks. The immense sizes of LLMs have led to very high resource demand and cost for running the models. Though the models are largely served using uniform high-caliber GPUs nowadays, utilizing a heterogeneous cluster with a mix of available high- and low-capacity GPUs can potentially substantially reduce the serving cost. There is a lack of designs to support efficient LLM serving using a heterogeneous cluster, while the current solutions focus on model partition and uniform compression among homogeneous devices. This paper proposes LLM-PQ, a system that advocates adaptive model quantization and phase-aware partition to improve LLM serving efficiency on heterogeneous GPU clusters. We carefully decide on mixed-precision model quantization together with phase-aware model partition and micro-batch sizing in distributed LLM serving with an efficient algorithm, to greatly enhance inference throughput while fulfilling user-specified model quality targets. Extensive experiments on production inference workloads in 11 different clusters demonstrate that LLM-PQ achieves up to 2.88x (2.26x on average) throughput improvement in inference, showing great advantages over state-of-the-art works.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenges faced in efficiently deploying large - scale language models (LLMs) on heterogeneous GPU clusters. Specifically, the paper points out that current solutions mainly focus on model partitioning and unified compression on homogeneous devices, lacking a design to effectively support LLM services in heterogeneous clusters. Heterogeneous clusters usually contain GPUs of different models and capacities, and these differences lead to problems of low resource utilization and increased cost. Therefore, the paper proposes a system named LLM - PQ, aiming to improve the service efficiency of LLMs on heterogeneous GPU clusters through adaptive model quantization and phase - aware partitioning. ### Main Problem Points: 1. **Resource Utilization and Cost Problems**: - The training and inference of large - scale language models require a large amount of GPU resources and are costly. - Current solutions mainly target homogeneous clusters and cannot fully utilize low - performance GPUs in heterogeneous clusters, resulting in resource waste and increased cost. 2. **Model Partitioning and Quantization Problems**: - Existing model partitioning methods (such as tensor parallelism and pipeline parallelism) usually assume that the devices are homogeneous, which may lead to low utilization of high - capacity GPUs or out - of - memory (OOM) errors on low - memory GPUs in heterogeneous clusters. - Unified model quantization methods (such as INT4) may cause memory waste on some high - performance GPUs, while on low - performance GPUs, it may not be sufficient to avoid OOM problems. 3. **Complexity of Generation Inference**: - The generation inference of LLMs consists of two phases: prefill and decode. The execution time and resource requirements of these two phases are significantly different, especially in heterogeneous clusters, and this difference is more pronounced. - Existing solutions mainly focus on single - stage models (such as encoder - based Transformers) and cannot be directly applied to the two - stage generation inference of LLMs. ### Solutions: 1. **Adaptive Mixed - Precision Quantization**: - LLM - PQ proposes an adaptive mixed - precision quantization method, which selects an appropriate quantization precision according to the memory and computing capabilities of different GPUs, thereby avoiding memory waste and improving model quality. 2. **Phase - Aware Partitioning**: - In order to make better use of the resources in heterogeneous clusters, LLM - PQ introduces a phase - aware partitioning method, which partitions the model according to the characteristics of the prefill and decode phases to ensure that the execution time of each phase is optimized. 3. **Micro - batch Scheduling**: - LLM - PQ also includes an efficient micro - batch scheduling strategy to further improve the inference throughput. ### Experimental Results: - LLM - PQ has been extensively experimented in 11 different clusters, and the results show that its inference throughput can be increased by up to 2.88 times (with an average increase of 2.26 times), significantly outperforming existing advanced methods. ### Summary: The paper solves the key problems of efficiently deploying large - scale language models on heterogeneous GPU clusters by proposing the LLM - PQ system. Through the adaptive mixed - precision quantization and phase - aware partitioning methods, it realizes efficient resource utilization and significant cost reduction.

LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization

HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models

Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity

Efficient Deployment of Large Language Model Across Cloud-Device Systems

One Queue Is All You Need: Resolving Head-of-Line Blocking in Large Language Model Serving

Fast and Efficient 2-bit LLM Inference on GPU: 2/4/16-bit in a Weight Matrix with Asynchronous Dequantization

Edge-LLM: A Collaborative Framework for Large Language Model Serving in Edge Computing

One QuantLLM for ALL: Fine-tuning Quantized LLMs Once for Efficient Deployments

SqueezeLLM: Dense-and-Sparse Quantization

MixPE: Quantization and Hardware Co-design for Efficient LLM Inference

COMET: Towards Partical W4A4KV4 LLMs Serving

Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems

SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models

Progressive Mixed-Precision Decoding for Efficient LLM Inference

QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models

SLoB: Suboptimal Load Balancing Scheduling in Local Heterogeneous GPU Clusters for Large Language Model Inference

EdgeQAT: Entropy and Distribution Guided Quantization-Aware Training for the Acceleration of Lightweight LLMs on the Edge

SmoothQuant+: Accurate and Efficient 4-bit Post-Training WeightQuantization for LLM

A System for Microserving of LLMs