Abstract:Quantization techniques are pivotal in reducing the memory and computational demands of deep neural network inference. Existing solutions, such as ZeroQuant, offer dynamic quantization for models like BERT and GPT but overlook crucial memory-bounded operators and the complexities of per-token quantization. Addressing these gaps, we present a novel, fully hardware-enhanced robust optimized post-training W8A8 quantization framework, ZeroQuant-HERO. This framework uniquely integrates both memory bandwidth and compute-intensive operators, aiming for optimal hardware performance. Additionally, it offers flexibility by allowing specific INT8 modules to switch to FP16/BF16 mode, enhancing accuracy.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the limitations of existing quantization techniques on Transformer models, especially the insufficiency in hardware performance optimization. Specifically: 1. **Existing quantization methods fail to fully consider hardware characteristics**: Most existing quantization methods mainly focus on algorithm - level optimization and ignore hardware - level limitations, such as memory bandwidth and compute - intensive operations. This results in limited performance improvement of the model on specific hardware even after quantization. 2. **Non - critical operations in dynamic quantization are not optimized**: For example, ZeroQuant proposes dynamic per - token activation quantization and per - column weight quantization, but does not take into account memory - constrained operations such as LayerNorm and Attention, which usually remain in FP16 or BF16 mode, affecting the overall performance. 3. **Additional overhead brought by per - token quantization**: When there is no opportunity for fusion operations, per - token quantization requires calling additional kernels, increasing the computational cost. To solve these problems, the paper proposes a brand - new hardware - enhanced robustly - optimized post - training quantization framework of W8A8 - ZeroQuant - HERO. The main contributions of this framework include: 1. **Comprehensively consider memory bandwidth and compute - intensive operations**: By optimizing these key operations, the framework can achieve the best hardware performance. 2. **Provide flexible quantization levels**: Allow users to select different ratios of INT8 to FP16/BF16 operations according to their needs to balance accuracy and latency. 3. **Introduce multiple quantization schemes**: - **Token - level quantization (TWQ)**: Applicable to the quantization of inputs and outputs, and can be fused with layer - normalization operations to reduce performance overhead. - **Feature - level quantization (FWQ)**: Applicable to certain operations in the attention module, and can be fused with memory - bound or compute - bound operations. - **Static quantization (SQ)**: Applicable to the quantization of intermediate tensors, improving the efficiency of matrix multiplication operations. Through these improvements, ZeroQuant - HERO aims to provide a more efficient and flexible quantization framework, especially for Transformer models, which can significantly improve the inference speed and reduce memory usage while maintaining high accuracy.

ZeroQuant-HERO: Hardware-Enhanced Robust Optimized Post-Training Quantization Framework for W8A8 Transformers

HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models

Hessian-based Mixed-Precision Quantization with Transition Aware Training for Neural Networks

PackQViT: Faster Sub-8-bit Vision Transformers Via Full and Packed Quantization on the Mobile.

Pse: Mixed Quantization Framework of Neural Networks for Efficient Deployment

Solving Oscillation Problem in Post-Training Quantization Through a Theoretical Perspective

FrameQuant: Flexible Low-Bit Quantization for Transformers

HAWQV3: Dyadic Neural Network Quantization

Automated Backend-Aware Post-Training Quantization

ZeroQ: A Novel Zero Shot Quantization Framework

EasyQuant: Post-training Quantization via Scale Optimization

Quantization without Tears

Towards Accurate and Efficient Sub-8-Bit Integer Training

Training High-Performance and Large-Scale Deep Neural Networks with Full 8-Bit Integers.

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

EfQAT: An Efficient Framework for Quantization-Aware Training

RepQuant: Towards Accurate Post-Training Quantization of Large Transformer Models via Scale Reparameterization

RAPQ: Rescuing Accuracy for Power-of-Two Low-bit Post-training Quantization

Hardware-Centric AutoML for Mixed-Precision Quantization

On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks

EfficientQAT: Efficient Quantization-Aware Training for Large Language Models