Abstract:Efficient management of GPU memory is essential for high throughput LLM inference. Prior systems used to reserve KV-cache memory ahead-of-time that resulted in wasted capacity due to internal fragmentation. Inspired by demand paging, vLLM proposed PagedAttention to enable dynamic memory allocation for KV-cache. This approach eliminates fragmentation and improves serving throughout. However, to be able to allocate physical memory dynamically, PagedAttention changes the layout of KV-cache from contiguous virtual memory to non-contiguous virtual memory. As a consequence, one needs to rewrite the attention kernels to support paging, and implement a memory manager in the serving framework. This results in both performance and programming overheads, as well as portability challenges in adopting state-of-the-art attention kernels. In this paper, we propose vAttention, a new approach for dynamic KV-cache memory management. In contrast to PagedAttention, vAttention stores KV-cache in contiguous virtual memory and leverages OS support for on-demand allocation of physical memory. vAttention thus enables one to use state-of-the art attention kernels out-of-the-box by adding support for dynamic allocation of physical memory without having to re-write their code. We implement vAttention in the vLLM serving stack to show that it also helps improve decode throughput by up to 1.99x over vLLM, and the end-to-end serving throughput by up to 1.22x and 1.29x, compared to using the state-of-the-art PagedAttention based kernels of FlashAttention and FlashInfer.

What problem does this paper attempt to address?

This paper attempts to address the problem of efficiently managing GPU memory to improve throughput during large-scale language model (LLM) inference. Specifically, the paper focuses on dynamically managing the GPU memory used to store key-value pair caches (KV-cache) in the attention mechanism. ### Background and Problem 1. **Issues with Static Memory Allocation**: - Traditional systems (such as Orca and FasterTransformer) pre-allocate GPU memory based on the maximum context length supported by the model. However, the actual number of generated decoding tokens is much less than the maximum context length, leading to severe internal fragmentation, which limits batch size and service throughput. 2. **Limitations of PagedAttention**: - PagedAttention alleviates fragmentation by dynamically allocating memory, but this method changes the layout of the KV-cache in virtual memory from contiguous to non-contiguous. This requires rewriting the attention kernel code and implementing a memory manager in the service framework, increasing performance overhead and programming complexity. - Experimental results show that PagedAttention suffers from performance degradation in multiple systems (such as vLLM, FlashAttention, FlashInfer, and TensorRT-LLM). ### Solution The paper proposes a new method—vAttention, which aims to retain the contiguity of the KV-cache in virtual memory while achieving dynamic allocation of physical memory. The main features of vAttention include: 1. **Leveraging Operating System Support**: - vAttention utilizes the operating system's virtual memory and on-demand paging capabilities instead of implementing paging in user space. This avoids rewriting the attention kernel code and allows direct use of state-of-the-art attention kernels. 2. **Efficient Memory Management**: - vAttention pre-allocates a large contiguous buffer in virtual memory but dynamically allocates physical memory at runtime. This retains the contiguity of virtual memory while avoiding physical memory waste. - By modifying the open-source CUDA unified virtual memory driver, it supports finer-grained physical memory allocation (64KB), reducing fragmentation and waste. 3. **Optimization Measures**: - Overlapping memory allocation and computation, pre-allocating pages, and delaying memory reclamation are optimization measures that hide the latency cost of memory allocation, improving the efficiency of vAttention. ### Experimental Results The paper validates the effectiveness of vAttention through experiments. Compared to PagedAttention, vAttention can improve decoding throughput by up to 1.99 times and end-to-end service throughput by up to 1.22 times and 1.29 times, respectively. ### Conclusion vAttention provides a more efficient and simpler dynamic KV-cache memory management method by leveraging the virtual memory and on-demand paging capabilities supported by the operating system, significantly improving the performance of large-scale language model inference.

vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention

Efficient Memory Management for Large Language Model Serving with PagedAttention

Open-AI model Efficient Memory Reduce Management for the Large Language Models (LLMs) Serving with Paged Attention of sharing the KV Cashes

vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving

POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference

Beyond KV Caching: Shared Attention for Efficient LLMs

KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head

MAS-Attention: Memory-Aware Stream Processing for Attention Acceleration on Resource-Constrained Edge Devices

SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition

Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation

TURBOATTENTION: Efficient Attention Approximation For High Throughputs LLMs

InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference

Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention

XKV: Personalized KV Cache Memory Reduction for Long-Context LLM Inference

In-context KV-Cache Eviction for LLMs via Attention-Gate

DeFT: Decoding with Flash Tree-attention for Efficient Tree-structured LLM Inference

Eigen Attention: Attention in Low-Rank Space for KV Cache Compression