Abstract:Post-training quantization (PTQ) is a powerful technique for model compression, reducing the numerical precision in neural networks without additional training overhead. Recent works have investigated adopting 8-bit floating-point formats(FP8) in the context of PTQ for model inference. However, floating-point formats smaller than 8 bits and their relative comparison in terms of accuracy-hardware cost with integers remains unexplored on FPGAs. In this work, we present minifloats, which are reduced-precision floating-point formats capable of further reducing the memory footprint, latency, and energy cost of a model while approaching full-precision model accuracy. We implement a custom FPGA-based multiply-accumulate operator library and explore the vast design space, comparing minifloat and integer representations across 3 to 8 bits for both weights and activations. We also examine the applicability of various integerbased quantization techniques to minifloats. Our experiments show that minifloats offer a promising alternative for emerging workloads such as vision transformers.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to further compress the machine - learning model through post - training quantization (PTQ) techniques while maintaining high model accuracy when deploying the model on resource - constrained devices. Specifically, the paper explores the use of floating - point number formats smaller than 8 bits (called minifloats) for quantization on FPGA to reduce the model's memory footprint, latency, and energy consumption while approaching the accuracy of the full - precision model. Existing research mainly focuses on 8 - bit integer quantization, and the floating - point numbers smaller than 8 bits and their relative performance comparison with integer quantization have not been fully explored. Therefore, the paper aims to fill this gap by systematically analyzing the performance of minifloats and integer quantization under different precision configurations and evaluating their advantages and disadvantages in hardware resource utilization. ### Main contributions: 1. **Proposed a new PTQ quantization framework for low - precision minifloats**, covering a precision range from 3 to 8 bits. 2. **Implemented a custom bit - width multiply - accumulate (MAC) operation library** for implementing custom integer and minifloats MAC operations on FPGA. 3. **Deeply explored the trade - off between precision and hardware resource utilization**, providing a detailed analysis of three visual models: ResNet - 18, MobileNetV2, and ViT - B - 32. ### Key findings: - **For 3 - bit weights and 3 - to - 4 - bit activations**, integer quantization usually outperforms minifloats. - **As the weight precision increases to 4 bits and above**, the performance of minifloats gradually exceeds that of integer quantization. Especially in complex models such as ViT - B - 32, minifloats have a larger dynamic range and can better handle outliers in weight and activation distributions. - **In the case of 4 - bit weights and 8 - bit activations**, both integer and minifloats representations can approach the accuracy of the full - precision model. - **In terms of FPGA resource utilization**, integer quantization performs better when resources are very limited, but as the resource budget increases, the performance of minifloats and integer quantization gradually converges. ### Conclusion: Through systematic analysis and experiments, the paper demonstrates the potential of minifloats in low - precision quantization, especially their advantages in handling complex models. These findings provide new ideas and technical support for the efficient deployment of deep - learning models on resource - constrained devices.

Shedding the Bits: Pushing the Boundaries of Quantization with Minifloats on FPGAs

HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models

Pse: Mixed Quantization Framework of Neural Networks for Efficient Deployment

Exploring the Potential of Flexible 8-Bit Format: Design and Algorithm

Bit-shrinking: Limiting Instantaneous Sharpness for Improving Post-training Quantization

Efficient Post-training Quantization with FP8 Formats

Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models

Quantization-Aware NN Layers with High-throughput FPGA Implementation for Edge AI

LSFQ: A Low Precision Full Integer Quantization for High-Performance FPGA-Based CNN Acceleration

Efficient 8-Bit Quantization of Transformer Neural Machine Language Translation Model

FPGA-Based Hybrid-Type Implementation of Quantized Neural Networks for Remote Sensing Applications

Towards Accurate and Efficient Sub-8-Bit Integer Training

1-Bit FQT: Pushing the Limit of Fully Quantized Training to 1-bit

Trainable Fixed-Point Quantization for Deep Learning Acceleration on FPGAs

A Learning Framework for n-bit Quantized Neural Networks toward FPGAs

LBFP: Logarithmic Block Floating Point Arithmetic for Deep Neural Networks

Trainable Power-of-2 Scale Factors for Hardware-friendly Network Quantization

Low-Bitwidth Floating Point Quantization for Efficient High-Quality Diffusion Models

HAWQV3: Dyadic Neural Network Quantization

decoupleQ: Towards 2-bit Post-Training Uniform Quantization via decoupling Parameters into Integer and Floating Points

FrameQuant: Flexible Low-Bit Quantization for Transformers