KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization

Tianyi Zhang,Jonah Yi,Zhaozhuo Xu,Anshumali Shrivastava
2024-05-07
Abstract:Efficient deployment of Large Language Models (LLMs) requires batching multiple requests together to improve throughput. As the batch size, context length, or model size increases, the size of the key and value (KV) cache can quickly become the main contributor to GPU memory usage and the bottleneck of inference latency. Quantization has emerged as an effective technique for KV cache compression, but existing methods still fail at very low bit widths. We observe that distinct channels of a key/value activation embedding are highly inter-dependent, and the joint entropy of multiple channels grows at a slower rate than the sum of their marginal entropies. Based on this insight, we propose Coupled Quantization (CQ), which couples multiple key/value channels together to exploit their inter-dependency and encode the activations in a more information-efficient manner. Extensive experiments reveal that CQ outperforms or is competitive with existing baselines in preserving model quality. Furthermore, we demonstrate that CQ can preserve model quality with KV cache quantized down to 1-bit.
Machine Learning
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the issue of key-value (KV) cache occupying a large amount of GPU memory and causing inference latency during the inference process of large language models (LLMs). As the batch size, context length, or model scale increases, the size of the KV cache quickly becomes a major contributor to GPU memory usage and a bottleneck for inference latency. Existing quantization methods cannot effectively compress the KV cache at extremely low bit widths, leading to a sharp decline in model quality. Therefore, the authors propose the Coupled Quantization (CQ) method, which utilizes the dependencies between different channels to achieve efficient information encoding, thereby maintaining model quality at extremely low bit widths. ### Main Contributions 1. **Observed High Dependency**: The authors found a high dependency between different channels of the same key/value activation embedding, a key insight that existing KV cache compression methods have not fully utilized. 2. **Proposed Coupled Quantization (CQ)**: The CQ method achieves more efficient information encoding by jointly quantizing multiple channels, leveraging their low entropy characteristics. 3. **Experimental Validation**: Through extensive experiments, the authors demonstrate the effectiveness of the CQ method in maintaining model quality at extreme compression levels (1 bit) and show that it outperforms or is comparable to existing methods in most cases. ### Method Overview - **Information-Theoretic Motivation**: The authors use concepts from information theory, demonstrating through the comparison of joint entropy and marginal entropy that jointly quantizing multiple channels can reduce the required number of bits. - **Coupled Quantization (CQ)**: The CQ method divides the channels of key/value activation embeddings into equally sized non-overlapping groups. Each group of channels is jointly quantized and shares a quantization code. By learning multi-channel centroids, CQ can maintain model quality at low bit widths. - **Centroid Learning**: CQ uses uniform clustering or second-order information-based clustering methods to learn multi-channel centroids, better preserving important activations. ### Experimental Results - **Perplexity and Quantization Error**: Experimental results show that as the number of jointly quantized channels increases, perplexity and quantization error significantly improve, approaching FP16 baseline performance. - **Benchmarking**: CQ performs excellently across multiple datasets and benchmarks, especially at low bit widths, outperforming or being comparable to existing dense and sparse quantization methods. ### Conclusion The CQ method achieves efficient KV cache compression by leveraging the dependencies between channels of key/value activation embeddings, significantly reducing GPU memory usage and inference latency while maintaining model quality. This method provides a new solution for the efficient deployment of large-scale language models.