Shedding the Bits: Pushing the Boundaries of Quantization with Minifloats on FPGAs

Shivam Aggarwal,Hans Jakob Damsgaard,Alessandro Pappalardo,Giuseppe Franco,Thomas B. Preußer,Michaela Blott,Tulika Mitra
2024-07-05
Abstract:Post-training quantization (PTQ) is a powerful technique for model compression, reducing the numerical precision in neural networks without additional training overhead. Recent works have investigated adopting 8-bit floating-point formats(FP8) in the context of PTQ for model inference. However, floating-point formats smaller than 8 bits and their relative comparison in terms of accuracy-hardware cost with integers remains unexplored on FPGAs. In this work, we present minifloats, which are reduced-precision floating-point formats capable of further reducing the memory footprint, latency, and energy cost of a model while approaching full-precision model accuracy. We implement a custom FPGA-based multiply-accumulate operator library and explore the vast design space, comparing minifloat and integer representations across 3 to 8 bits for both weights and activations. We also examine the applicability of various integerbased quantization techniques to minifloats. Our experiments show that minifloats offer a promising alternative for emerging workloads such as vision transformers.
Computer Vision and Pattern Recognition,Artificial Intelligence,Hardware Architecture,Machine Learning,Performance
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to further compress the machine - learning model through post - training quantization (PTQ) techniques while maintaining high model accuracy when deploying the model on resource - constrained devices. Specifically, the paper explores the use of floating - point number formats smaller than 8 bits (called minifloats) for quantization on FPGA to reduce the model's memory footprint, latency, and energy consumption while approaching the accuracy of the full - precision model. Existing research mainly focuses on 8 - bit integer quantization, and the floating - point numbers smaller than 8 bits and their relative performance comparison with integer quantization have not been fully explored. Therefore, the paper aims to fill this gap by systematically analyzing the performance of minifloats and integer quantization under different precision configurations and evaluating their advantages and disadvantages in hardware resource utilization. ### Main contributions: 1. **Proposed a new PTQ quantization framework for low - precision minifloats**, covering a precision range from 3 to 8 bits. 2. **Implemented a custom bit - width multiply - accumulate (MAC) operation library** for implementing custom integer and minifloats MAC operations on FPGA. 3. **Deeply explored the trade - off between precision and hardware resource utilization**, providing a detailed analysis of three visual models: ResNet - 18, MobileNetV2, and ViT - B - 32. ### Key findings: - **For 3 - bit weights and 3 - to - 4 - bit activations**, integer quantization usually outperforms minifloats. - **As the weight precision increases to 4 bits and above**, the performance of minifloats gradually exceeds that of integer quantization. Especially in complex models such as ViT - B - 32, minifloats have a larger dynamic range and can better handle outliers in weight and activation distributions. - **In the case of 4 - bit weights and 8 - bit activations**, both integer and minifloats representations can approach the accuracy of the full - precision model. - **In terms of FPGA resource utilization**, integer quantization performs better when resources are very limited, but as the resource budget increases, the performance of minifloats and integer quantization gradually converges. ### Conclusion: Through systematic analysis and experiments, the paper demonstrates the potential of minifloats in low - precision quantization, especially their advantages in handling complex models. These findings provide new ideas and technical support for the efficient deployment of deep - learning models on resource - constrained devices.