vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention

Ramya Prabhu,Ajay Nayak,Jayashree Mohan,Ramachandran Ramjee,Ashish Panwar
2024-07-12
Abstract:Efficient management of GPU memory is essential for high throughput LLM inference. Prior systems used to reserve KV-cache memory ahead-of-time that resulted in wasted capacity due to internal fragmentation. Inspired by demand paging, vLLM proposed PagedAttention to enable dynamic memory allocation for KV-cache. This approach eliminates fragmentation and improves serving throughout. However, to be able to allocate physical memory dynamically, PagedAttention changes the layout of KV-cache from contiguous virtual memory to non-contiguous virtual memory. As a consequence, one needs to rewrite the attention kernels to support paging, and implement a memory manager in the serving framework. This results in both performance and programming overheads, as well as portability challenges in adopting state-of-the-art attention kernels. In this paper, we propose vAttention, a new approach for dynamic KV-cache memory management. In contrast to PagedAttention, vAttention stores KV-cache in contiguous virtual memory and leverages OS support for on-demand allocation of physical memory. vAttention thus enables one to use state-of-the art attention kernels out-of-the-box by adding support for dynamic allocation of physical memory without having to re-write their code. We implement vAttention in the vLLM serving stack to show that it also helps improve decode throughput by up to 1.99x over vLLM, and the end-to-end serving throughput by up to 1.22x and 1.29x, compared to using the state-of-the-art PagedAttention based kernels of FlashAttention and FlashInfer.
Machine Learning,Operating Systems
What problem does this paper attempt to address?
This paper attempts to address the problem of efficiently managing GPU memory to improve throughput during large-scale language model (LLM) inference. Specifically, the paper focuses on dynamically managing the GPU memory used to store key-value pair caches (KV-cache) in the attention mechanism. ### Background and Problem 1. **Issues with Static Memory Allocation**: - Traditional systems (such as Orca and FasterTransformer) pre-allocate GPU memory based on the maximum context length supported by the model. However, the actual number of generated decoding tokens is much less than the maximum context length, leading to severe internal fragmentation, which limits batch size and service throughput. 2. **Limitations of PagedAttention**: - PagedAttention alleviates fragmentation by dynamically allocating memory, but this method changes the layout of the KV-cache in virtual memory from contiguous to non-contiguous. This requires rewriting the attention kernel code and implementing a memory manager in the service framework, increasing performance overhead and programming complexity. - Experimental results show that PagedAttention suffers from performance degradation in multiple systems (such as vLLM, FlashAttention, FlashInfer, and TensorRT-LLM). ### Solution The paper proposes a new method—vAttention, which aims to retain the contiguity of the KV-cache in virtual memory while achieving dynamic allocation of physical memory. The main features of vAttention include: 1. **Leveraging Operating System Support**: - vAttention utilizes the operating system's virtual memory and on-demand paging capabilities instead of implementing paging in user space. This avoids rewriting the attention kernel code and allows direct use of state-of-the-art attention kernels. 2. **Efficient Memory Management**: - vAttention pre-allocates a large contiguous buffer in virtual memory but dynamically allocates physical memory at runtime. This retains the contiguity of virtual memory while avoiding physical memory waste. - By modifying the open-source CUDA unified virtual memory driver, it supports finer-grained physical memory allocation (64KB), reducing fragmentation and waste. 3. **Optimization Measures**: - Overlapping memory allocation and computation, pre-allocating pages, and delaying memory reclamation are optimization measures that hide the latency cost of memory allocation, improving the efficiency of vAttention. ### Experimental Results The paper validates the effectiveness of vAttention through experiments. Compared to PagedAttention, vAttention can improve decoding throughput by up to 1.99 times and end-to-end service throughput by up to 1.22 times and 1.29 times, respectively. ### Conclusion vAttention provides a more efficient and simpler dynamic KV-cache memory management method by leveraging the virtual memory and on-demand paging capabilities supported by the operating system, significantly improving the performance of large-scale language model inference.