FP8 versus INT8 for efficient deep learning inference

Mart van Baalen,Andrey Kuzmin,Suparna S Nair,Yuwei Ren,Eric Mahurin,Chirag Patel,Sundar Subramanian,Sanghyuk Lee,Markus Nagel,Joseph Soriaga,Tijmen Blankevoort

2023-06-15

Abstract:Recently, the idea of using FP8 as a number format for neural network training has been floating around the deep learning world. Given that most training is currently conducted with entire networks in FP32, or sometimes FP16 with mixed-precision, the step to having some parts of a network run in FP8 with 8-bit weights is an appealing potential speed-up for the generally costly and time-intensive training procedures in deep learning. A natural question arises regarding what this development means for efficient inference on edge devices. In the efficient inference device world, workloads are frequently executed in INT8. Sometimes going even as low as INT4 when efficiency calls for it. In this whitepaper, we compare the performance for both the FP8 and INT formats for efficient on-device inference. We theoretically show the difference between the INT and FP formats for neural networks and present a plethora of post-training quantization and quantization-aware-training results to show how this theory translates to practice. We also provide a hardware analysis showing that the FP formats are somewhere between 50-180% less efficient in terms of compute in dedicated hardware than the INT format. Based on our research and a read of the research field, we conclude that although the proposed FP8 format could be good for training, the results for inference do not warrant a dedicated implementation of FP8 in favor of INT8 for efficient inference. We show that our results are mostly consistent with previous findings but that important comparisons between the formats have thus far been lacking. Finally, we discuss what happens when FP8-trained networks are converted to INT8 and conclude with a brief discussion on the most efficient way for on-device deployment and an extensive suite of INT8 results for many models.

Machine Learning

What problem does this paper attempt to address?

The paper aims to address the potential benefits and drawbacks of using the FP8 format for deep learning inference, particularly in comparison to the widely-used INT8 format. The primary focus is on determining whether FP8, with its floating-point representation, can offer significant advantages over INT8, especially in terms of efficiency and accuracy for inference tasks on edge devices. ### Key Points: 1. **Background and Context:** - Currently, most deep learning models are trained using FP32 or FP16 formats, and when deployed for inference on edge devices, they are often quantized to INT8 for efficiency. - There is growing interest in using FP8 for training, which could potentially simplify the deployment process by avoiding the quantization step. 2. **Hardware Efficiency:** - The paper demonstrates that FP8 is at least 50% less efficient in terms of area and energy usage compared to INT8 in dedicated hardware implementations. This means that FP8 would need to be significantly more accurate to justify its use over INT8. - Floating-point matrix multiplications are generally less efficient than integer ones, leading to higher hardware costs and potentially slower performance. 3. **Accuracy Comparison:** - Theoretical analysis shows that the main difference between FP8 and INT8 lies in their ability to handle outliers in the data. FP8 formats with more exponent bits (e.g., FP8-E4) can better represent outliers, while INT8 and FP8 formats with fewer exponent bits (e.g., FP8-E3) may struggle with this aspect.

FP8 versus INT8 for efficient deep learning inference

Exploring the Potential of Flexible 8-Bit Format: Design and Algorithm

Training Deep Neural Networks with 8-bit Floating Point Numbers

Novel adaptive quantization methodology for 8-bit floating-point DNN training

Efficient Post-training Quantization with FP8 Formats

Performance Characterization of using Quantization for DNN Inference on Edge Devices: Extended Version

Training High-Performance and Large-Scale Deep Neural Networks with Full 8-Bit Integers.

Gradient Distribution-aware INT8 Training for Neural Networks

Towards Accurate and Efficient Sub-8-Bit Integer Training

Towards Federated Learning with On-device Training and Communication in 8-bit Floating Point

Towards Unified INT8 Training for Convolutional Neural Network

Neural Networks Integer Computation: Quantizing Convolutional Neural Networks of Inference and Training for Object Detection in Embedded Systems

Ascend HiFloat8 Format for Deep Learning

Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation

Shedding the Bits: Pushing the Boundaries of Quantization with Minifloats on FPGAs

Efficient 8-Bit Quantization of Transformer Neural Machine Language Translation Model

Low Power Inference for On-Device Visual Recognition with a Quantization-Friendly Solution.

BitQ: Tailoring Block Floating Point Precision for Improved DNN Efficiency on Resource-Constrained Devices

Training and Inference with Integers in Deep Neural Networks

Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models

LBFP: Logarithmic Block Floating Point Arithmetic for Deep Neural Networks