Efficient Post-training Quantization with FP8 Formats

Haihao Shen,Naveen Mellempudi,Xin He,Qun Gao,Chang Wang,Mengni Wang
2024-04-01
Abstract:Recent advances in deep learning methods such as LLMs and Diffusion models have created a need for improved quantization methods that can meet the computational demands of these modern architectures while maintaining accuracy. Towards this goal, we study the advantages of FP8 data formats for post-training quantization across 75 unique network architectures covering a wide range of tasks, including machine translation, language modeling, text generation, image classification, generation, and segmentation. We examine three different FP8 representations (E5M2, E4M3, and E3M4) to study the effects of varying degrees of trade-off between dynamic range and precision on model accuracy. Based on our extensive study, we developed a quantization workflow that generalizes across different network architectures. Our empirical results show that FP8 formats outperform INT8 in multiple aspects, including workload coverage (92.64% vs. 65.87%), model accuracy and suitability for a broader range of operations. Furthermore, our findings suggest that E4M3 is better suited for NLP models, whereas E3M4 performs marginally better than E4M3 on computer vision tasks. The code is publicly available on Intel Neural Compressor:
Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
This paper primarily discusses how to effectively quantify trained neural networks in deep learning to reduce computational requirements while maintaining accuracy. The study focuses on three different representations in FP8 format, namely E5M2, E4M3, and E3M4, which have different trade-offs between dynamic range and precision. The paper points out that although INT8 quantization is widely used, it faces challenges when dealing with tasks with a larger dynamic range, such as large language models, due to limited representation capability for outlier values. The paper proposes a general and scalable FP8 quantization process applicable to different network architectures and conducts experiments on 75 networks covering various tasks, such as machine translation, language modeling, and image classification. The results show that the FP8 format outperforms INT8 in terms of workload coverage (92.64% vs. 65.87%), model accuracy, and operational suitability. Specifically, E4M3 is more suitable for natural language processing (NLP) tasks, while E3M4 slightly outperforms E4M3 in computer vision tasks. Additionally, the paper compares static and dynamic quantization methods and finds that dynamic quantization can improve accuracy for certain models in E4M3 and E3M4 formats. The research also indicates that the FP8 format is more effective than INT8 in handling outlier values for tasks involving layer normalization. In conclusion, this paper addresses the efficiency and accuracy issues in quantizing deep learning models and proposes a new FP8 quantization strategy that better adapts to the requirements of modern deep learning architectures.