Accumulator-Aware Post-Training Quantization

Ian Colbert,Fabian Grob,Giuseppe Franco,Jinjie Zhang,Rayan Saab
2024-09-26
Abstract:Several recent studies have investigated low-precision accumulation, reporting improvements in throughput, power, and area across various platforms. However, the accompanying proposals have only considered the quantization-aware training (QAT) paradigm, in which models are fine-tuned or trained from scratch with quantization in the loop. As models continue to grow in size, QAT techniques become increasingly more expensive, which has motivated the recent surge in post-training quantization (PTQ) research. To the best of our knowledge, ours marks the first formal study of accumulator-aware quantization in the PTQ setting. To bridge this gap, we introduce AXE, a practical framework of accumulator-aware extensions designed to endow overflow avoidance guarantees to existing layer-wise PTQ algorithms. We theoretically motivate AXE and demonstrate its flexibility by implementing it on top of two state-of-the-art PTQ algorithms: GPFQ and OPTQ. We further generalize AXE to support multi-stage accumulation for the first time, opening the door for full datapath optimization and scaling to large language models (LLMs). We evaluate AXE across image classification and language generation models, and observe significant improvements in the trade-off between accumulator bit width and model accuracy over baseline methods.
Machine Learning,Artificial Intelligence,Discrete Mathematics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to achieve accumulator - aware quantization in the post - training quantization (PTQ) setting. Specifically, the paper aims to overcome the limitation that existing accumulator - aware quantization methods are mainly concentrated in quantization - aware training (QAT), because QAT techniques become more and more expensive as the model size grows. In addition, the main challenge faced in reducing the accumulator bit width in practical applications is a significant increase in the risk of numerical overflow, which may lead to arithmetic errors and seriously affect the model accuracy. To solve these problems, the authors propose AXE - a practical framework that designs a series of accumulator - aware extensions, aiming to provide overflow - avoidance guarantees for existing hierarchical PTQ algorithms. AXE can not only be flexibly applied to different PTQ algorithms (such as GPFQ and OPTQ), but also supports multi - stage accumulation for the first time, thus paving the way for the optimization of large - language models (LLMs). Through this framework, the paper shows how to maintain or improve the accuracy of the model while reducing the accumulator bit width, especially in image classification and language generation tasks. In summary, the core objective of this paper is to reduce the inference cost of deep - learning models through accumulator - aware quantization techniques without sacrificing model performance, which is especially important when deploying large - scale models in resource - constrained environments.