Abstract:Large Language Models (LLMs) exhibit impressive performance across various tasks, but deploying them for inference poses challenges. Their high resource demands often necessitate complex, costly multi-GPU pipelines, or the use of smaller, less capable models. While quantization offers a promising solution utilizing lower precision for model storage, existing methods frequently experience significant performance drops at lower precision levels. Additionally, they typically provide only a limited set of solutions at specific bit levels, many of which are extensively manually tuned. To address these challenges, we propose a new method called SKIM: Scaled K-means clustering wIth Mixed precision. Our approach introduces two novel techniques: 1. A greedy algorithm to solve approximately optimal bit allocation across weight channels, and 2. A trainable scaling vector for non-differentiable K-means clustering. These techniques substantially improve performance and can be adapted to any given bit. Notably, in terms of model perplexity, our method narrows the gap between 3-bit quantized LLaMA models and their full precision counterparts by 16.3% on average.

What problem does this paper attempt to address?

This paper attempts to address the high - resource - demand problems faced by large - language models (LLMs) during inference deployment, especially the significant performance degradation during the quantization process and the limitations of existing methods that can only provide solutions at specific bit levels. Specifically: 1. **High - resource - demand**: Large - language models such as GPT and LLaMA require a large amount of computational and memory resources during inference. For example, when loading parameters, GPT requires 350GB of memory, while LLaMA - 65B requires at least 130GB of memory, which is far beyond the capacity of an A100 - 80G GPU. 2. **Performance degradation during quantization**: Although quantization (i.e., converting high - precision data into a low - precision format) can reduce the storage requirements of the model and increase the inference speed, existing quantization methods often lead to significant performance degradation when using lower bit widths. Moreover, these methods usually provide only limited, specific - bit - level solutions, and many are the result of extensive manual tuning. 3. **Lack of flexibility**: Existing quantization methods can usually handle only specific bit levels (such as INT4 or INT8) and cannot flexibly adapt to any arbitrarily specified bit width, including non - integer bits. To solve these problems, the authors propose a new method - SKIM (Scaled K - means clustering wIth Mixed precision). SKIM improves the quantization effect by introducing two novel techniques: 1. **Greedy algorithm**: Used to approximately optimally allocate bit positions for different weight channels to achieve more reasonable resource allocation. 2. **Trainable scaling vector**: Used for non - differentiable K - means clustering operations, effectively adjusting the differences between columns, thereby improving the performance of the quantized model. Through these techniques, SKIM can not only adapt to any arbitrarily specified bit width (including non - integer bits), but also, in the case of 3 - bit quantization, the gap in model perplexity is reduced by an average of 16.3% compared to the full - precision model.

SKIM: Any-bit Quantization Pushing The Limits of Post-Training Quantization

SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models

CrossQuant: A Post-Training Quantization Method with Smaller Quantization Kernel for Precise Large Language Model Compression

ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models

Post Training Quantization of Large Language Models with Microscaling Formats

CLAQ: Pushing the Limits of Low-Bit Post-Training Quantization for LLMs

QuantEase: Optimization-based Quantization for Language Models - An Efficient and Intuitive Algorithm

QuIP: 2-Bit Quantization of Large Language Models With Guarantees

APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models

ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Optimizing Large Language Models through Quantization: A Comparative Analysis of PTQ and QAT Techniques

A Comprehensive Evaluation of Quantization Strategies for Large Language Models

VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

The case for 4-bit precision: k-bit Inference Scaling Laws

BitDistiller: Unleashing the Potential of Sub-4-Bit LLMs via Self-Distillation

SLiM: One-shot Quantized Sparse Plus Low-rank Approximation of LLMs

Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models

Fast and Efficient 2-bit LLM Inference on GPU: 2/4/16-bit in a Weight Matrix with Asynchronous Dequantization