Abstract:Large language models (LLMs) excel in natural language processing but demand intensive computation. To mitigate this, various quantization methods have been explored, yet they compromise LLM performance. This paper unveils a previously overlooked type of outlier in LLMs. Such outliers are found to allocate most of the attention scores on initial tokens of input, termed as pivot tokens, which is crucial to the performance of quantized LLMs. Given that, we propose IntactKV to generate the KV cache of pivot tokens losslessly from the full-precision model. The approach is simple and easy to combine with existing quantization solutions. Besides, IntactKV can be calibrated as additional LLM parameters to boost the quantized LLMs further. Mathematical analysis also proves that IntactKV effectively reduces the upper bound of quantization error. Empirical results show that IntactKV brings consistent improvement and achieves lossless weight-only INT4 quantization on various downstream tasks, leading to the new state-of-the-art for LLM quantization.

What problem does this paper attempt to address?

This paper attempts to solve the problem of performance degradation in large language models (LLMs) during quantization. Specifically, the paper discovers a new type of outliers, which are mainly concentrated on the initial tokens of the input sequence (referred to as "pivot tokens"), and have an important impact on the self - attention mechanism. The existence of such outliers leads to the performance degradation of quantized LLMs. To solve this problem, the paper proposes the IntactKV method, which reduces quantization errors by losslessly generating key - value caches (KV caches) of pivot tokens from the full - precision model, thereby improving the performance of the quantized model. ### Background and Problem Description of the Paper Large language models (LLMs) perform excellently in natural language processing tasks, but their computational resource requirements are huge. To reduce the computational cost, researchers have explored various quantization methods, such as network quantization, pruning, and speculative decoding. Among them, network quantization converts model parameters or activations from floating - point numbers to fixed - point number formats to reduce the model size and computational resource requirements. However, quantization inevitably affects the performance of LLMs, especially because the outliers in the model activations are sensitive to quantization. ### Newly Discovered Outlier Types The paper discovers a new type of outliers, which are particularly obvious at the initial tokens of the input sequence (such as [BOS], commas, and periods, etc.). These outliers make the self - attention mechanism focus on these pivot tokens and ignore other tokens. This phenomenon is called "attention sinks". Attention sinks are crucial for model performance, so the impact on these pivot tokens during the quantization process needs special attention. ### Solution: IntactKV To solve the above problems, the paper proposes a method named IntactKV. The core idea of IntactKV is to losslessly generate the KV caches of pivot tokens from the full - precision model and use them in combination with the quantized model. The specific steps are as follows: 1. **Generate KV Caches**: Use the full - precision model to generate the KV caches of pivot tokens and save them. 2. **Load KV Caches**: During the inference process, the quantized model loads these losslessly generated KV caches as a prefix, connects them with the remaining KV caches, and continues the autoregressive decoding. ### Experimental Results The experimental results show that IntactKV can significantly improve the performance of the quantized model. This is specifically manifested in the following aspects: - **Language Generation Tasks**: In the perplexity (PPL) tests on the C4 and WikiText2 datasets, IntactKV significantly enhances the generation ability of the AWQ - quantized model, surpassing the previous state - of - the - art method OmniQuant. - **Multi - task Understanding (MMLU)**: In the MMLU benchmark test, IntactKV significantly improves the performance of the quantized Vicuna model in the zero - sample and five - sample settings. - **Common - sense Question Answering Tasks**: In multiple zero - sample common - sense question answering tasks, IntactKV also significantly improves the performance of the quantized model. ### Theoretical Analysis The paper also provides a theoretical analysis, proving that IntactKV can effectively reduce the upper bound of quantization errors. By keeping the KV caches of pivot tokens unchanged, IntactKV reduces the propagation of quantization errors in the self - attention module. ### Conclusion IntactKV is a simple and effective strategy that can significantly improve the performance of quantized LLMs without increasing additional inference overhead. By losslessly generating the KV caches of pivot tokens from the full - precision model, IntactKV effectively alleviates the impact of quantization errors on model performance.

IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact