EfficientQAT: Efficient Quantization-Aware Training for Large Language Models

Mengzhao Chen,Wenqi Shao,Peng Xu,Jiahao Wang,Peng Gao,Kaipeng Zhang,Ping Luo

2024-10-02

Abstract:Large language models (LLMs) are crucial in modern natural language processing and artificial intelligence. However, they face challenges in managing their significant memory requirements. Although quantization-aware training (QAT) offers a solution by reducing memory consumption through low-bit representations with minimal accuracy loss, it is impractical due to substantial training resources. To address this, we propose Efficient Quantization-Aware Training (EfficientQAT), a more feasible QAT algorithm. EfficientQAT involves two consecutive phases: Block-wise training of all parameters (Block-AP) and end-to-end training of quantization parameters (E2E-QP). To the best of our knowledge, Block-AP is the first method to enable direct training of all parameters in a block-wise manner, reducing accuracy loss in low-bit scenarios by enhancing the solution space during optimization. E2E-QP then trains only the quantization parameters (step sizes) end-to-end, further improving the performance of quantized models by considering interactions among all sub-modules. Extensive experiments demonstrate that EfficientQAT outperforms previous quantization methods across a range of models, including base LLMs, instruction-tuned LLMs, and multimodal LLMs, with scales from 7B to 70B parameters at various quantization bits. For instance, EfficientQAT obtains a 2-bit Llama-2-70B model on a single A100-80GB GPU in 41 hours, with less than 3 points accuracy degradation compared to the full precision (69.48 vs. 72.41). Code is available at <a class="link-external link-https" href="https://github.com/OpenGVLab/EfficientQAT" rel="external noopener nofollow">this https URL</a>.

Machine Learning,Artificial Intelligence,Computation and Language

What problem does this paper attempt to address?

This paper aims to address the significant challenges of memory requirements in large language models (LLMs). Specifically, although Quantization-Aware Training (QAT) reduces memory consumption through low-bit representation and achieves this with minimal accuracy loss, it requires substantial training resources, making this approach impractical for ultra-large-scale models. To tackle this issue, the paper proposes EfficientQAT, a more feasible QAT algorithm. EfficientQAT consists of two consecutive stages: 1. **Block-level Parameter Training (Block-AP)**: This method achieves block-level direct training of all parameters for the first time, reducing accuracy loss in low-bit quantization scenarios by enhancing the optimization space. 2. **End-to-End Quantized Parameter Training (E2E-QP)**: This stage involves end-to-end training of only the quantized parameters (step sizes), further improving the performance of the quantized model while considering the interactions between all submodules. Through these methods, EfficientQAT outperforms existing quantization methods across a range of models (including foundational LLMs, instruction-tuned LLMs, and multimodal LLMs) and demonstrates superiority under different quantization bit numbers (from 7B to 70B parameter scales). For example, on a single A100-80GB GPU, EfficientQAT can complete 2-bit Llama-2-70B model training in 41 hours, with an accuracy loss of less than 3 points compared to the full-precision model (69.48 vs. 72.41).

EfficientQAT: Efficient Quantization-Aware Training for Large Language Models

HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models

DL-QAT: Weight-Decomposed Low-Rank Quantization-Aware Training for Large Language Models

QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models

Low-Rank Quantization-Aware Training for LLMs

OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models

LLM-QAT: Data-Free Quantization Aware Training for Large Language Models

ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models

Evaluating Quantized Large Language Models

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models

Optimizing Large Language Models through Quantization: A Comparative Analysis of PTQ and QAT Techniques

VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

Post Training Quantization of Large Language Models with Microscaling Formats

CBQ: Cross-Block Quantization for Large Language Models

EdgeQAT: Entropy and Distribution Guided Quantization-Aware Training for the Acceleration of Lightweight LLMs on the Edge

RPTQ: Reorder-based Post-training Quantization for Large Language Models

QuantEase: Optimization-based Quantization for Language Models - An Efficient and Intuitive Algorithm

Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization

MBQ: Modality-Balanced Quantization for Large Vision-Language Models

AWEQ: Post-Training Quantization with Activation-Weight Equalization for Large Language Models