Abstract:KV cache stores key and value states from previous tokens to avoid re-computation, yet it demands substantial storage space, especially for long sequences. Adaptive KV cache compression seeks to discern the saliency of tokens, preserving vital information while aggressively compressing those of less importance. However, previous methods of this approach exhibit significant performance degradation at high compression ratios due to inaccuracies in identifying salient tokens. In this paper, we present ZipCache, an accurate and efficient KV cache quantization method for LLMs. First, we construct a strong baseline for quantizing KV cache. Through the proposed channel-separable tokenwise quantization scheme, the memory overhead of quantization parameters are substantially reduced compared to fine-grained groupwise quantization. To enhance the compression ratio, we propose normalized attention score as an effective metric for identifying salient tokens by considering the lower triangle characteristics of the attention matrix. Moreover, we develop an efficient approximation method that decouples the saliency metric from full attention scores, enabling compatibility with fast attention implementations like FlashAttention. Extensive experiments demonstrate that ZipCache achieves superior compression ratios, fast generation speed and minimal performance losses compared with previous KV cache compression methods. For instance, when evaluating Mistral-7B model on GSM8k dataset, ZipCache is capable of compressing the KV cache by $4.98\times$, with only a $0.38\%$ drop in accuracy. In terms of efficiency, ZipCache also showcases a $37.3\%$ reduction in prefill-phase latency, a $56.9\%$ reduction in decoding-phase latency, and a $19.8\%$ reduction in GPU memory usage when evaluating LLaMA3-8B model with a input length of $4096$.

Residual vector quantization for KV cache compression in large language model

SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models

KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization

Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression

AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations

WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More

CSKV: Training-Efficient Channel Shrinking for KV Cache in Long-Context Scenarios

Unifying KV Cache Compression for Large Language Models with LeanKV

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

QAQ: Quality Adaptive Quantization for LLM KV Cache

A Simple and Effective $L_2$ Norm-Based Strategy for KV Cache Compression

No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization

GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM

Zero-Delay QKV Compression for Mitigating KV Cache and Network Bottlenecks in LLM Inference

ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification

QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead

Lexico: Extreme KV Cache Compression via Sparse Coding over Universal Dictionaries