GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM

Hao Kang,Qingru Zhang,Souvik Kundu,Geonhwa Jeong,Zaoxing Liu,Tushar Krishna,Tuo Zhao
2024-10-01
Abstract:Key-value (KV) caching has become the de-facto to accelerate generation speed for large language models (LLMs) inference. However, the growing cache demand with increasing sequence length has transformed LLM inference to be a memory bound problem, significantly constraining the system throughput. Existing methods rely on dropping unimportant tokens or quantizing all entries uniformly. Such methods, however, often incur high approximation errors to represent the compressed matrices. The autoregressive decoding process further compounds the error of each step, resulting in critical deviation in model generation and deterioration of performance. To tackle this challenge, we propose GEAR, an efficient KV cache compression framework that achieves near-lossless high-ratio compression. GEAR first applies quantization to majority of entries of similar magnitudes to ultra-low precision. It then employs a low rank matrix to approximate the quantization error, and a sparse matrix to remedy individual errors from outlier entries. By adeptly integrating three techniques, GEAR is able to fully exploit their synergistic potentials. Our experiments demonstrate that compared to alternatives, GEAR achieves near-lossless 4-bit KV cache compression with up to 2.38x throughput improvement, while reducing peak-memory size up to 2.29x. Our code is publicly available at <a class="link-external link-https" href="https://github.com/HaoKang-Timmy/GEAR" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that during the generation and inference process of large - scale language models (LLMs), as the key - value (KV) cache requirements increase significantly with the growth of sequence length, memory becomes a bottleneck, severely limiting the system throughput. Existing compression methods, such as discarding unimportant tokens or grouped quantization, can effectively reduce the cache size, but often introduce high approximation errors. These errors will further accumulate during the autoregressive decoding process, leading to significant deviations in the model - generated results and performance degradation. To address this challenge, the paper proposes the GEAR framework, an efficient error - reduction framework that achieves near - lossless performance at a high compression ratio by enhancing the quantization scheme and combining two error - reduction components. Specifically, the GEAR framework decomposes and compresses the KV cache matrix through the following three steps: 1. **Quantize the matrix**: First, apply existing quantization methods to efficiently quantize most entries of similar sizes to ultra - low precision. 2. **Low - rank matrix**: Then introduce a low - rank matrix to efficiently approximate the quantization residuals. 3. **Sparse matrix**: Finally, use a sparse matrix to correct individual errors caused by abnormal entries. Through the synergy of these three components, GEAR can effectively reduce approximation errors and achieve high performance at a high compression ratio. Experimental results show that at 2 - bit compression, GEAR improves the average accuracy by 14.95% compared to the state - of - the - art baseline methods, while reducing the peak memory usage by 2.39 times and increasing the throughput by 2.10 to 5.07 times. In addition, GEAR also introduces a stream - buffering strategy to further improve the inference speed.