FP8 versus INT8 for efficient deep learning inference

Mart van Baalen,Andrey Kuzmin,Suparna S Nair,Yuwei Ren,Eric Mahurin,Chirag Patel,Sundar Subramanian,Sanghyuk Lee,Markus Nagel,Joseph Soriaga,Tijmen Blankevoort
2023-06-15
Abstract:Recently, the idea of using FP8 as a number format for neural network training has been floating around the deep learning world. Given that most training is currently conducted with entire networks in FP32, or sometimes FP16 with mixed-precision, the step to having some parts of a network run in FP8 with 8-bit weights is an appealing potential speed-up for the generally costly and time-intensive training procedures in deep learning. A natural question arises regarding what this development means for efficient inference on edge devices. In the efficient inference device world, workloads are frequently executed in INT8. Sometimes going even as low as INT4 when efficiency calls for it. In this whitepaper, we compare the performance for both the FP8 and INT formats for efficient on-device inference. We theoretically show the difference between the INT and FP formats for neural networks and present a plethora of post-training quantization and quantization-aware-training results to show how this theory translates to practice. We also provide a hardware analysis showing that the FP formats are somewhere between 50-180% less efficient in terms of compute in dedicated hardware than the INT format. Based on our research and a read of the research field, we conclude that although the proposed FP8 format could be good for training, the results for inference do not warrant a dedicated implementation of FP8 in favor of INT8 for efficient inference. We show that our results are mostly consistent with previous findings but that important comparisons between the formats have thus far been lacking. Finally, we discuss what happens when FP8-trained networks are converted to INT8 and conclude with a brief discussion on the most efficient way for on-device deployment and an extensive suite of INT8 results for many models.
Machine Learning
What problem does this paper attempt to address?
The paper aims to address the potential benefits and drawbacks of using the FP8 format for deep learning inference, particularly in comparison to the widely-used INT8 format. The primary focus is on determining whether FP8, with its floating-point representation, can offer significant advantages over INT8, especially in terms of efficiency and accuracy for inference tasks on edge devices. ### Key Points: 1. **Background and Context:** - Currently, most deep learning models are trained using FP32 or FP16 formats, and when deployed for inference on edge devices, they are often quantized to INT8 for efficiency. - There is growing interest in using FP8 for training, which could potentially simplify the deployment process by avoiding the quantization step. 2. **Hardware Efficiency:** - The paper demonstrates that FP8 is at least 50% less efficient in terms of area and energy usage compared to INT8 in dedicated hardware implementations. This means that FP8 would need to be significantly more accurate to justify its use over INT8. - Floating-point matrix multiplications are generally less efficient than integer ones, leading to higher hardware costs and potentially slower performance. 3. **Accuracy Comparison:** - Theoretical analysis shows that the main difference between FP8 and INT8 lies in their ability to handle outliers in the data. FP8 formats with more exponent bits (e.g., FP8-E4) can better represent outliers, while INT8 and FP8 formats with fewer exponent bits (e.g., FP8-E3) may struggle with this aspect.