Abstract:Key-value (KV) caching has become the de-facto to accelerate generation speed for large language models (LLMs) inference. However, the growing cache demand with increasing sequence length has transformed LLM inference to be a memory bound problem, significantly constraining the system throughput. Existing methods rely on dropping unimportant tokens or quantizing all entries uniformly. Such methods, however, often incur high approximation errors to represent the compressed matrices. The autoregressive decoding process further compounds the error of each step, resulting in critical deviation in model generation and deterioration of performance. To tackle this challenge, we propose GEAR, an efficient KV cache compression framework that achieves near-lossless high-ratio compression. GEAR first applies quantization to majority of entries of similar magnitudes to ultra-low precision. It then employs a low rank matrix to approximate the quantization error, and a sparse matrix to remedy individual errors from outlier entries. By adeptly integrating three techniques, GEAR is able to fully exploit their synergistic potentials. Our experiments demonstrate that compared to alternatives, GEAR achieves near-lossless 4-bit KV cache compression with up to 2.38x throughput improvement, while reducing peak-memory size up to 2.29x. Our code is publicly available at <a class="link-external link-https" href="https://github.com/HaoKang-Timmy/GEAR" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that during the generation and inference process of large - scale language models (LLMs), as the key - value (KV) cache requirements increase significantly with the growth of sequence length, memory becomes a bottleneck, severely limiting the system throughput. Existing compression methods, such as discarding unimportant tokens or grouped quantization, can effectively reduce the cache size, but often introduce high approximation errors. These errors will further accumulate during the autoregressive decoding process, leading to significant deviations in the model - generated results and performance degradation. To address this challenge, the paper proposes the GEAR framework, an efficient error - reduction framework that achieves near - lossless performance at a high compression ratio by enhancing the quantization scheme and combining two error - reduction components. Specifically, the GEAR framework decomposes and compresses the KV cache matrix through the following three steps: 1. **Quantize the matrix**: First, apply existing quantization methods to efficiently quantize most entries of similar sizes to ultra - low precision. 2. **Low - rank matrix**: Then introduce a low - rank matrix to efficiently approximate the quantization residuals. 3. **Sparse matrix**: Finally, use a sparse matrix to correct individual errors caused by abnormal entries. Through the synergy of these three components, GEAR can effectively reduce approximation errors and achieve high performance at a high compression ratio. Experimental results show that at 2 - bit compression, GEAR improves the average accuracy by 14.95% compared to the state - of - the - art baseline methods, while reducing the peak memory usage by 2.39 times and increasing the throughput by 2.10 to 5.07 times. In addition, GEAR also introduces a stream - buffering strategy to further improve the inference speed.

GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM

GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM

No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization

LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy

XKV: Personalized KV Cache Memory Reduction for Long-Context LLM Inference

Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression

Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference

Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation

Unifying KV Cache Compression for Large Language Models with LeanKV

Effectively Compress KV Heads for LLM

Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

MiniCache: KV Cache Compression in Depth Dimension for Large Language Models

KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head

InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management

Efficient LLM Inference with Kcache

Zero-Delay QKV Compression for Mitigating KV Cache and Network Bottlenecks in LLM Inference

Lossless KV Cache Compression to 2%

Residual vector quantization for KV cache compression in large language model

SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

EMS: Adaptive Evict-then-Merge Strategy for Head-wise KV Cache Compression Based on Global-Local Importance