Abstract:Efficient deployment of Large Language Models (LLMs) requires batching multiple requests together to improve throughput. As the batch size, context length, or model size increases, the size of the key and value (KV) cache can quickly become the main contributor to GPU memory usage and the bottleneck of inference latency. Quantization has emerged as an effective technique for KV cache compression, but existing methods still fail at very low bit widths. We observe that distinct channels of a key/value activation embedding are highly inter-dependent, and the joint entropy of multiple channels grows at a slower rate than the sum of their marginal entropies. Based on this insight, we propose Coupled Quantization (CQ), which couples multiple key/value channels together to exploit their inter-dependency and encode the activations in a more information-efficient manner. Extensive experiments reveal that CQ outperforms or is competitive with existing baselines in preserving model quality. Furthermore, we demonstrate that CQ can preserve model quality with KV cache quantized down to 1-bit.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the issue of key-value (KV) cache occupying a large amount of GPU memory and causing inference latency during the inference process of large language models (LLMs). As the batch size, context length, or model scale increases, the size of the KV cache quickly becomes a major contributor to GPU memory usage and a bottleneck for inference latency. Existing quantization methods cannot effectively compress the KV cache at extremely low bit widths, leading to a sharp decline in model quality. Therefore, the authors propose the Coupled Quantization (CQ) method, which utilizes the dependencies between different channels to achieve efficient information encoding, thereby maintaining model quality at extremely low bit widths. ### Main Contributions 1. **Observed High Dependency**: The authors found a high dependency between different channels of the same key/value activation embedding, a key insight that existing KV cache compression methods have not fully utilized. 2. **Proposed Coupled Quantization (CQ)**: The CQ method achieves more efficient information encoding by jointly quantizing multiple channels, leveraging their low entropy characteristics. 3. **Experimental Validation**: Through extensive experiments, the authors demonstrate the effectiveness of the CQ method in maintaining model quality at extreme compression levels (1 bit) and show that it outperforms or is comparable to existing methods in most cases. ### Method Overview - **Information-Theoretic Motivation**: The authors use concepts from information theory, demonstrating through the comparison of joint entropy and marginal entropy that jointly quantizing multiple channels can reduce the required number of bits. - **Coupled Quantization (CQ)**: The CQ method divides the channels of key/value activation embeddings into equally sized non-overlapping groups. Each group of channels is jointly quantized and shares a quantization code. By learning multi-channel centroids, CQ can maintain model quality at low bit widths. - **Centroid Learning**: CQ uses uniform clustering or second-order information-based clustering methods to learn multi-channel centroids, better preserving important activations. ### Experimental Results - **Perplexity and Quantization Error**: Experimental results show that as the number of jointly quantized channels increases, perplexity and quantization error significantly improve, approaching FP16 baseline performance. - **Benchmarking**: CQ performs excellently across multiple datasets and benchmarks, especially at low bit widths, outperforming or being comparable to existing dense and sparse quantization methods. ### Conclusion The CQ method achieves efficient KV cache compression by leveraging the dependencies between channels of key/value activation embeddings, significantly reducing GPU memory usage and inference latency while maintaining model quality. This method provides a new solution for the efficient deployment of large-scale language models.

KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization

SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

Residual vector quantization for KV cache compression in large language model

Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression

No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization

WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More

CSKV: Training-Efficient Channel Shrinking for KV Cache in Long-Context Scenarios

QAQ: Quality Adaptive Quantization for LLM KV Cache

Unifying KV Cache Compression for Large Language Models with LeanKV

PQCache: Product Quantization-based KVCache for Long Context LLM Inference

AlignedKV: Reducing Memory Access of KV-Cache with Precision-Aligned Quantization

Lossless KV Cache Compression to 2%

MiniCache: KV Cache Compression in Depth Dimension for Large Language Models

ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification

QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead

Zero-Delay QKV Compression for Mitigating KV Cache and Network Bottlenecks in LLM Inference

KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing