SKIM: Any-bit Quantization Pushing The Limits of Post-Training Quantization

Runsheng Bai,Qiang Liu,Bo Liu
2024-12-05
Abstract:Large Language Models (LLMs) exhibit impressive performance across various tasks, but deploying them for inference poses challenges. Their high resource demands often necessitate complex, costly multi-GPU pipelines, or the use of smaller, less capable models. While quantization offers a promising solution utilizing lower precision for model storage, existing methods frequently experience significant performance drops at lower precision levels. Additionally, they typically provide only a limited set of solutions at specific bit levels, many of which are extensively manually tuned. To address these challenges, we propose a new method called SKIM: Scaled K-means clustering wIth Mixed precision. Our approach introduces two novel techniques: 1. A greedy algorithm to solve approximately optimal bit allocation across weight channels, and 2. A trainable scaling vector for non-differentiable K-means clustering. These techniques substantially improve performance and can be adapted to any given bit. Notably, in terms of model perplexity, our method narrows the gap between 3-bit quantized LLaMA models and their full precision counterparts by 16.3% on average.
Machine Learning
What problem does this paper attempt to address?
This paper attempts to address the high - resource - demand problems faced by large - language models (LLMs) during inference deployment, especially the significant performance degradation during the quantization process and the limitations of existing methods that can only provide solutions at specific bit levels. Specifically: 1. **High - resource - demand**: Large - language models such as GPT and LLaMA require a large amount of computational and memory resources during inference. For example, when loading parameters, GPT requires 350GB of memory, while LLaMA - 65B requires at least 130GB of memory, which is far beyond the capacity of an A100 - 80G GPU. 2. **Performance degradation during quantization**: Although quantization (i.e., converting high - precision data into a low - precision format) can reduce the storage requirements of the model and increase the inference speed, existing quantization methods often lead to significant performance degradation when using lower bit widths. Moreover, these methods usually provide only limited, specific - bit - level solutions, and many are the result of extensive manual tuning. 3. **Lack of flexibility**: Existing quantization methods can usually handle only specific bit levels (such as INT4 or INT8) and cannot flexibly adapt to any arbitrarily specified bit width, including non - integer bits. To solve these problems, the authors propose a new method - SKIM (Scaled K - means clustering wIth Mixed precision). SKIM improves the quantization effect by introducing two novel techniques: 1. **Greedy algorithm**: Used to approximately optimally allocate bit positions for different weight channels to achieve more reasonable resource allocation. 2. **Trainable scaling vector**: Used for non - differentiable K - means clustering operations, effectively adjusting the differences between columns, thereby improving the performance of the quantized model. Through these techniques, SKIM can not only adapt to any arbitrarily specified bit width (including non - integer bits), but also, in the case of 3 - bit quantization, the gap in model perplexity is reduced by an average of 16.3% compared to the full - precision model.