Accumulator-Aware Post-Training Quantization

Ian Colbert,Fabian Grob,Giuseppe Franco,Jinjie Zhang,Rayan Saab

2024-09-26

Abstract:Several recent studies have investigated low-precision accumulation, reporting improvements in throughput, power, and area across various platforms. However, the accompanying proposals have only considered the quantization-aware training (QAT) paradigm, in which models are fine-tuned or trained from scratch with quantization in the loop. As models continue to grow in size, QAT techniques become increasingly more expensive, which has motivated the recent surge in post-training quantization (PTQ) research. To the best of our knowledge, ours marks the first formal study of accumulator-aware quantization in the PTQ setting. To bridge this gap, we introduce AXE, a practical framework of accumulator-aware extensions designed to endow overflow avoidance guarantees to existing layer-wise PTQ algorithms. We theoretically motivate AXE and demonstrate its flexibility by implementing it on top of two state-of-the-art PTQ algorithms: GPFQ and OPTQ. We further generalize AXE to support multi-stage accumulation for the first time, opening the door for full datapath optimization and scaling to large language models (LLMs). We evaluate AXE across image classification and language generation models, and observe significant improvements in the trade-off between accumulator bit width and model accuracy over baseline methods.

Machine Learning,Artificial Intelligence,Discrete Mathematics

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to achieve accumulator - aware quantization in the post - training quantization (PTQ) setting. Specifically, the paper aims to overcome the limitation that existing accumulator - aware quantization methods are mainly concentrated in quantization - aware training (QAT), because QAT techniques become more and more expensive as the model size grows. In addition, the main challenge faced in reducing the accumulator bit width in practical applications is a significant increase in the risk of numerical overflow, which may lead to arithmetic errors and seriously affect the model accuracy. To solve these problems, the authors propose AXE - a practical framework that designs a series of accumulator - aware extensions, aiming to provide overflow - avoidance guarantees for existing hierarchical PTQ algorithms. AXE can not only be flexibly applied to different PTQ algorithms (such as GPFQ and OPTQ), but also supports multi - stage accumulation for the first time, thus paving the way for the optimization of large - language models (LLMs). Through this framework, the paper shows how to maintain or improve the accuracy of the model while reducing the accumulator bit width, especially in image classification and language generation tasks. In summary, the core objective of this paper is to reduce the inference cost of deep - learning models through accumulator - aware quantization techniques without sacrificing model performance, which is especially important when deploying large - scale models in resource - constrained environments.

Accumulator-Aware Post-Training Quantization

HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models

A2Q+: Improving Accumulator-Aware Weight Quantization

Pse: Mixed Quantization Framework of Neural Networks for Efficient Deployment

A2Q: Accumulator-Aware Quantization with Guaranteed Overflow Avoidance

EfficientQAT: Efficient Quantization-Aware Training for Large Language Models

Optimizing Large Language Models through Quantization: A Comparative Analysis of PTQ and QAT Techniques

APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models

EdgeQAT: Entropy and Distribution Guided Quantization-Aware Training for the Acceleration of Lightweight LLMs on the Edge

Attention-aware Post-training Quantization without Backpropagation

Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization

QuantEase: Optimization-based Quantization for Language Models - An Efficient and Intuitive Algorithm

Post-training Quantization or Quantization-aware Training? That is the Question

Solving Oscillation Problem in Post-Training Quantization Through a Theoretical Perspective

Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Evaluating Quantized Large Language Models

EfQAT: An Efficient Framework for Quantization-Aware Training

Towards Next-Level Post-Training Quantization of Hyper-Scale Transformers

ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models

Post Training Quantization of Large Language Models with Microscaling Formats