Abstract:Large Language Models (LLMs) demonstrate substantial potential across a diverse array of domains via request serving. However, as trends continue to push for expanding context sizes, the autoregressive nature of LLMs results in highly dynamic behavior of the attention layers, showcasing significant differences in computational characteristics and memory requirements from the non-attention layers. This presents substantial challenges for resource management and performance optimization in service systems. Existing static model parallelism and resource allocation strategies fall short when dealing with this dynamicity. To address the issue, we propose Infinite-LLM, a novel LLM serving system designed to effectively handle dynamic context lengths. Infinite-LLM disaggregates attention layers from an LLM's inference process, facilitating flexible and independent resource scheduling that optimizes computational performance and enhances memory utilization jointly. By leveraging a pooled GPU memory strategy across a cluster, Infinite-LLM not only significantly boosts system throughput but also supports extensive context lengths. Evaluated on a dataset with context lengths ranging from a few to 2000K tokens across a cluster with 32 A100 GPUs, Infinite-LLM demonstrates throughput improvement of 1.35-3.4x compared to state-of-the-art methods, enabling efficient and elastic LLM deployment.

What problem does this paper attempt to address?

The paper mainly targets the resource management and performance optimization challenges faced by Large Language Models (LLMs) when dealing with long text scenarios. As LLMs evolve, the supported context length continues to increase, leading to highly dynamic resource demands in service systems. Traditional static model parallel strategies and resource allocation methods struggle to effectively cope with these dynamic changes. Key issues raised in the paper include: 1. **Inefficient model parallelism**: The model parallel strategies required for processing long and short text tasks are significantly different. Traditional methods use a fixed degree of parallelism, which is not flexible enough to adapt to the different needs of long and normal-length texts. 2. **Inefficient resource management across instances**: Due to the uncertainty of request lengths, pre-allocating resources becomes impractical, leading to large variations in computing and memory resource demands. For example, one instance may need to handle a large number of short text tasks, while another may encounter memory-intensive long text tasks. To address the above issues, the paper proposes the Infinite-LLM system, whose main contributions include: - Revealing the dynamic characteristics of LLM request services and identifying the limitations of existing static model parallel deployment and KV-Cache scheduling. - Introducing the DistAttention mechanism, a novel attention mechanism that can flexibly separate attention computation and KV-Cache in a distributed environment. - Designing the Infinite-LLM system, which can effectively manage highly dynamic context lengths and balance resource demands between different instances through cluster-level KV-Cache scheduling, thereby achieving high overall system throughput. - Experiments show that Infinite-LLM can process texts up to 2000K tokens on 32 A100 GPUs, improving the end-to-end performance by 1.35-3.4 times compared to the most advanced LLM service systems. In summary, Infinite-LLM aims to improve the service efficiency and resource utilization of large language models when processing long text scenarios through dynamic resource allocation and optimized attention layer processing methods.

Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache

Efficient and Economic Large Language Model Inference with Attention Offloading

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

Efficient Memory Management for Large Language Model Serving with PagedAttention

ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency

ReAttention: Training-Free Infinite Context with Finite Attention Scope

Fast distributed inference serving for large language models

Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity

Ltri-LLM: Streaming Long Context Inference for LLMs with Training-Free Dynamic Triangular Attention Pattern

LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models

LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism

A Little Goes a Long Way: Efficient Long Context Training and Inference with Partial Contexts

Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory

Inference Performance Optimization for Large Language Models on CPUs

Large Language Models (llms) Inference Offloading and Resource Allocation in Cloud-Edge Networks: an Active Inference Approach

Efficient LLM inference solution on Intel GPU

A System for Microserving of LLMs