Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation

Chaoyi Jiang,Lei Gao,Hossein Entezari Zarch,Murali Annavaram

2024-11-26

Abstract:Inference for Large Language Models (LLMs) is computationally demanding. To reduce the cost of auto-regressive decoding, Key-Value (KV) caching is used to store intermediate activations, enabling GPUs to perform only the incremental computation required for each new token. This approach significantly lowers the computational overhead for token generation. However, the memory required for KV caching grows rapidly, often exceeding the capacity of GPU memory. A cost-effective alternative is to offload KV cache to CPU memory, which alleviates GPU memory pressure but shifts the bottleneck to the limited bandwidth of the PCIe connection between the CPU and GPU. Existing methods attempt to address these issues by overlapping GPU computation with I/O or employing CPU-GPU heterogeneous execution, but they are hindered by excessive data movement and dependence on CPU capabilities. In this paper, we introduce an efficient CPU-GPU I/O-aware LLM inference method that avoids transferring the entire KV cache from CPU to GPU by recomputing partial KV cache from activations while concurrently transferring the remaining KV cache via PCIe bus. This approach overlaps GPU recomputation with data transfer to minimize idle GPU time and maximize inference performance. Our method is fully automated by integrating a profiler module that utilizes input characteristics and system hardware information, a scheduler module to optimize the distribution of computation and communication workloads, and a runtime module to efficiently execute the derived execution plan. Experimental results show that our method achieves up to 35.8% lower latency and 46.2% higher throughput during decoding compared to state-of-the-art approaches.

Machine Learning,Performance

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the GPU out - of - memory issue caused by the large amount of memory occupied by Key - Value (KV) caches during the inference process of large - language models (LLMs). Although transferring the KV caches to the CPU memory can alleviate this problem, this method introduces a new bottleneck - the limited PCIe bus bandwidth, which leads to an increase in data - transfer latency and affects the GPU utilization and overall inference performance. To solve these problems, the paper proposes an efficient CPU - GPU I/O - aware LLM inference method. By combining partial KV - cache recomputation with asynchronous data transfer, it reduces the GPU idle time and improves the inference performance. Specifically, instead of transferring the entire KV cache from the CPU to the GPU, this method only transfers the activation data required to generate part of the KV cache, and recomputes this part of the KV cache on the GPU, while the remaining KV cache is asynchronously transferred via the PCIe bus. This method effectively solves the problems of high data - transfer latency and reliance on CPU resources in existing methods by minimizing the GPU idle time and maximizing the inference performance. In addition, the paper also introduces an automated framework, including an analysis module for collecting system hardware information; a scheduling module for determining the optimal computational and communicational workload distribution through linear programming; and a runtime module for efficiently executing the derived execution plan. Experimental results show that this method significantly outperforms the existing state - of - the - art methods in terms of reducing decoding latency and increasing throughput.

Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation

InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference

Efficient LLM Inference with Kcache

XKV: Personalized KV Cache Memory Reduction for Long-Context LLM Inference

Efficient LLM inference solution on Intel GPU

PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference

Exploiting Intel Advanced Matrix Extensions (AMX) for Large Language Model Inference

GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM

Inference Performance Optimization for Large Language Models on CPUs

Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching

ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

PQCache: Product Quantization-based KVCache for Long Context LLM Inference

CORM: Cache Optimization with Recent Message for Large Language Model Inference

Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems

VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration

InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management

FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines

Compute Or Load KV Cache? Why Not Both?

KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation