ZeroQuant-HERO: Hardware-Enhanced Robust Optimized Post-Training Quantization Framework for W8A8 Transformers

Zhewei Yao,Reza Yazdani Aminabadi,Stephen Youn,Xiaoxia Wu,Elton Zheng,Yuxiong He
2023-10-27
Abstract:Quantization techniques are pivotal in reducing the memory and computational demands of deep neural network inference. Existing solutions, such as ZeroQuant, offer dynamic quantization for models like BERT and GPT but overlook crucial memory-bounded operators and the complexities of per-token quantization. Addressing these gaps, we present a novel, fully hardware-enhanced robust optimized post-training W8A8 quantization framework, ZeroQuant-HERO. This framework uniquely integrates both memory bandwidth and compute-intensive operators, aiming for optimal hardware performance. Additionally, it offers flexibility by allowing specific INT8 modules to switch to FP16/BF16 mode, enhancing accuracy.
Machine Learning,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the limitations of existing quantization techniques on Transformer models, especially the insufficiency in hardware performance optimization. Specifically: 1. **Existing quantization methods fail to fully consider hardware characteristics**: Most existing quantization methods mainly focus on algorithm - level optimization and ignore hardware - level limitations, such as memory bandwidth and compute - intensive operations. This results in limited performance improvement of the model on specific hardware even after quantization. 2. **Non - critical operations in dynamic quantization are not optimized**: For example, ZeroQuant proposes dynamic per - token activation quantization and per - column weight quantization, but does not take into account memory - constrained operations such as LayerNorm and Attention, which usually remain in FP16 or BF16 mode, affecting the overall performance. 3. **Additional overhead brought by per - token quantization**: When there is no opportunity for fusion operations, per - token quantization requires calling additional kernels, increasing the computational cost. To solve these problems, the paper proposes a brand - new hardware - enhanced robustly - optimized post - training quantization framework of W8A8 - ZeroQuant - HERO. The main contributions of this framework include: 1. **Comprehensively consider memory bandwidth and compute - intensive operations**: By optimizing these key operations, the framework can achieve the best hardware performance. 2. **Provide flexible quantization levels**: Allow users to select different ratios of INT8 to FP16/BF16 operations according to their needs to balance accuracy and latency. 3. **Introduce multiple quantization schemes**: - **Token - level quantization (TWQ)**: Applicable to the quantization of inputs and outputs, and can be fused with layer - normalization operations to reduce performance overhead. - **Feature - level quantization (FWQ)**: Applicable to certain operations in the attention module, and can be fused with memory - bound or compute - bound operations. - **Static quantization (SQ)**: Applicable to the quantization of intermediate tensors, improving the efficiency of matrix multiplication operations. Through these improvements, ZeroQuant - HERO aims to provide a more efficient and flexible quantization framework, especially for Transformer models, which can significantly improve the inference speed and reduce memory usage while maintaining high accuracy.