Abstract:Due to their large size, generative Large Language Models (LLMs) require significant computing and storage resources. This paper introduces a new post-training quantization method, GPTQT, to reduce memory usage and enhance processing speed by expressing the weight of LLM in 3bit/2bit. Practice has shown that minimizing the quantization error of weights is ineffective, leading to overfitting. Therefore, GPTQT employs a progressive two-step approach: initially quantizing weights using Linear quantization to a relatively high bit, followed by converting obtained int weight to lower bit binary coding. A re-explore strategy is proposed to optimize initial scaling factor. During inference, these steps are merged into pure binary coding, enabling efficient computation. Testing across various models and datasets confirms GPTQT's effectiveness. Compared to the strong 3-bit quantization baseline, GPTQT further reduces perplexity by 4.01 on opt-66B and increases speed by 1.24 times on opt-30b. The results on Llama2 show that GPTQT is currently the best binary coding quantization method for such kind of LLMs.

What problem does this paper attempt to address?

This paper addresses the high computational and storage resource demands of large language models (LLMs) by proposing a novel post-training quantization method—GPTQT (Generative Pre-trained Transformer Quantize Twice). Its goal is to reduce memory usage and improve processing speed by representing the weights of LLMs in 3-bit or 2-bit binary codes to achieve efficiency gains. ### Main Contributions 1. **Proposing GPTQT**: A novel post-training quantization method that employs a heterogeneous, progressive two-stage quantization process to convert LLM weights into low-bit binary codes. 2. **Re-exploring Scaling Factors**: A new strategy is proposed to address the changes in representation range during the quantization process to optimize accuracy. 3. **Efficient Inference Process**: Demonstrates how weights processed by GPTQT can eliminate intermediate states during inference, enabling efficient binary-coded weight computation methods and significantly improving processing speed. ### Methodology - **Two-step Quantization Process**: Initially, linear quantization is used to quantize weights to a relatively higher bit count (e.g., 5 bits), and then these integer weights are further converted into lower-bit binary codes (e.g., 3 bits). This phased approach helps maintain and recover accuracy. - **Re-exploring Scaling Factors**: Since the second stage of the quantization process introduces changes in the representation range, it is necessary to re-evaluate and adjust the scaling factors to accommodate these changes and optimize accuracy. - **Merging Intermediate Steps**: During inference, these two steps can be merged into pure binary coding, allowing the use of more efficient computation methods such as LUT-GEMM. ### Experimental Results - Tests on multiple models and datasets confirm the effectiveness of GPTQT. Compared to a strong 3-bit quantization baseline, GPTQT further reduces perplexity by 4.01 on opt-66B and improves speed by 1.24 times on opt-30b. - Results on the Llama2 model indicate that GPTQT is currently the best binary-coded quantization method for such LLMs. In summary, GPTQT aims to improve the efficiency of large language models through innovative quantization techniques, reducing storage requirements and enhancing computational performance.

GPTQT: Quantize Large Language Models Twice to Push the Efficiency

HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models

Post Training Quantization of Large Language Models with Microscaling Formats

Optimizing Large Language Models through Quantization: A Comparative Analysis of PTQ and QAT Techniques

CDQuant: Accurate Post-training Weight Quantization of Large Pre-trained Models using Greedy Coordinate Descent

CDQuant: Greedy Coordinate Descent for Accurate LLM Quantization

RPTQ: Reorder-based Post-training Quantization for Large Language Models

VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

EfficientQAT: Efficient Quantization-Aware Training for Large Language Models

Evaluating Quantized Large Language Models

GPTVQ: The Blessing of Dimensionality for LLM Quantization

QuantEase: Optimization-based Quantization for Language Models - An Efficient and Intuitive Algorithm

FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs

APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization

DL-QAT: Weight-Decomposed Low-Rank Quantization-Aware Training for Large Language Models

QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models

CrossQuant: A Post-Training Quantization Method with Smaller Quantization Kernel for Precise Large Language Model Compression

ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models

QQQ: Quality Quattuor-Bit Quantization for Large Language Models