Efficient Post-training Quantization with FP8 Formats

Haihao Shen,Naveen Mellempudi,Xin He,Qun Gao,Chang Wang,Mengni Wang

2024-04-01

Abstract:Recent advances in deep learning methods such as LLMs and Diffusion models have created a need for improved quantization methods that can meet the computational demands of these modern architectures while maintaining accuracy. Towards this goal, we study the advantages of FP8 data formats for post-training quantization across 75 unique network architectures covering a wide range of tasks, including machine translation, language modeling, text generation, image classification, generation, and segmentation. We examine three different FP8 representations (E5M2, E4M3, and E3M4) to study the effects of varying degrees of trade-off between dynamic range and precision on model accuracy. Based on our extensive study, we developed a quantization workflow that generalizes across different network architectures. Our empirical results show that FP8 formats outperform INT8 in multiple aspects, including workload coverage (92.64% vs. 65.87%), model accuracy and suitability for a broader range of operations. Furthermore, our findings suggest that E4M3 is better suited for NLP models, whereas E3M4 performs marginally better than E4M3 on computer vision tasks. The code is publicly available on Intel Neural Compressor:

Artificial Intelligence,Computation and Language

What problem does this paper attempt to address?

This paper primarily discusses how to effectively quantify trained neural networks in deep learning to reduce computational requirements while maintaining accuracy. The study focuses on three different representations in FP8 format, namely E5M2, E4M3, and E3M4, which have different trade-offs between dynamic range and precision. The paper points out that although INT8 quantization is widely used, it faces challenges when dealing with tasks with a larger dynamic range, such as large language models, due to limited representation capability for outlier values. The paper proposes a general and scalable FP8 quantization process applicable to different network architectures and conducts experiments on 75 networks covering various tasks, such as machine translation, language modeling, and image classification. The results show that the FP8 format outperforms INT8 in terms of workload coverage (92.64% vs. 65.87%), model accuracy, and operational suitability. Specifically, E4M3 is more suitable for natural language processing (NLP) tasks, while E3M4 slightly outperforms E4M3 in computer vision tasks. Additionally, the paper compares static and dynamic quantization methods and finds that dynamic quantization can improve accuracy for certain models in E4M3 and E3M4 formats. The research also indicates that the FP8 format is more effective than INT8 in handling outlier values for tasks involving layer normalization. In conclusion, this paper addresses the efficiency and accuracy issues in quantizing deep learning models and proposes a new FP8 quantization strategy that better adapts to the requirements of modern deep learning architectures.

Efficient Post-training Quantization with FP8 Formats

Hessian-based Mixed-Precision Quantization with Transition Aware Training for Neural Networks

Pse: Mixed Quantization Framework of Neural Networks for Efficient Deployment

Improving Neural Network Efficiency Via Post-training Quantization with Adaptive Floating-Point

Exploring the Potential of Flexible 8-Bit Format: Design and Algorithm

Solving Oscillation Problem in Post-Training Quantization Through a Theoretical Perspective

Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models

Towards Accurate and Efficient Sub-8-Bit Integer Training

Shedding the Bits: Pushing the Boundaries of Quantization with Minifloats on FPGAs

EasyQuant: Post-training Quantization via Scale Optimization

Novel adaptive quantization methodology for 8-bit floating-point DNN training

FP8 versus INT8 for efficient deep learning inference

Optimization-based Post-training Quantization with Bit-split and Stitching

Loss Aware Post-training Quantization

Training High-Performance and Large-Scale Deep Neural Networks with Full 8-Bit Integers.

QQQ: Quality Quattuor-Bit Quantization for Large Language Models

COMQ: A Backpropagation-Free Algorithm for Post-Training Quantization

Efficient 8-Bit Quantization of Transformer Neural Machine Language Translation Model

Post Training Quantization of Large Language Models with Microscaling Formats

HAWQV3: Dyadic Neural Network Quantization

Automated Backend-Aware Post-Training Quantization