AWEQ: Post-Training Quantization with Activation-Weight Equalization for Large Language Models

Baisong Li,Xingwang Wang,Haixiao Xu

2023-11-12

Abstract:Large language models(LLMs) exhibit excellent performance across a variety of tasks, but they come with significant computational and storage costs. Quantizing these models is an effective way to alleviate this issue. However, existing methods struggle to strike a balance between model accuracy and hardware efficiency. This is where we introduce AWEQ, a post-training method that requires no additional training overhead. AWEQ excels in both ultra-low-bit quantization and 8-bit weight and activation (W8A8) quantization. There is an observation that weight quantization is less challenging than activation quantization. AWEQ transfers the difficulty of activation quantization to weights using channel equalization, achieving a balance between the quantization difficulties of both, and thereby maximizing performance. We have further refined the equalization method to mitigate quantization bias error, ensuring the robustness of the model. Extensive experiments on popular models such as LLaMA and OPT demonstrate that AWEQ outperforms all existing post-training quantization methods for large models.

Machine Learning,Artificial Intelligence,Computation and Language

What problem does this paper attempt to address?

The paper attempts to address the problem of achieving efficient Post-Training Quantization (PTQ) in large language models (LLMs) to reduce the computational and storage overhead of these models while maintaining their accuracy. Existing quantization methods struggle to balance model accuracy and hardware efficiency, especially when quantizing activations. To this end, the paper proposes a new post-training quantization method called AWEQ (Activation-Weight Equalization Quantization), which shifts the quantization difficulty from activations to weights and introduces dynamic statistical bias correction (BC) to improve the performance and robustness of quantization. Specifically, the main contributions of AWEQ include: 1. **Ultra-low bit quantization and 8-bit quantization**: AWEQ performs well in both ultra-low bit quantization and 8-bit weight and activation quantization (W8A8) scenarios. 2. **No additional training overhead**: AWEQ is a post-training method that does not require an additional training process. 3. **Reduced quantization error**: Through channel equalization techniques, AWEQ can reduce the waste of quantization grid points caused by outliers, thereby maximizing the retention of the original model's information. 4. **Dynamic statistical bias correction**: The introduction of the BC method corrects the bias errors introduced during the quantization process, ensuring the robustness of the model. The paper validates the effectiveness of AWEQ through extensive experiments on widely used LLaMA and OPT models, showing that AWEQ outperforms existing post-training quantization methods on multiple tasks.

AWEQ: Post-Training Quantization with Activation-Weight Equalization for Large Language Models

HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models

Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

EfficientQAT: Efficient Quantization-Aware Training for Large Language Models

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Post Training Quantization of Large Language Models with Microscaling Formats

APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models

VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

DAQ: Density-Aware Post-Training Weight-Only Quantization For LLMs

OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Models

QQQ: Quality Quattuor-Bit Quantization for Large Language Models

LLM-QAT: Data-Free Quantization Aware Training for Large Language Models

LQER: Low-Rank Quantization Error Reconstruction for LLMs

ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models

RPTQ: Reorder-based Post-training Quantization for Large Language Models

OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models

WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More

Evaluating Quantized Large Language Models

AffineQuant: Affine Transformation Quantization for Large Language Models

GWQ: Gradient-Aware Weight Quantization for Large Language Models