Abstract:Quantization is a popular technique used in Deep Neural Networks (DNN) inference to reduce the size of models and improve the overall numerical performance by exploiting native hardware. This paper attempts to conduct an elaborate performance characterization of the benefits of using quantization techniques -- mainly FP16/INT8 variants with static and dynamic schemes -- using the MLPerf Edge Inference benchmarking methodology. The study is conducted on Intel x86 processors and Raspberry Pi device with ARM processor. The paper uses a number of DNN inference frameworks, including OpenVINO (for Intel CPUs only), TensorFlow Lite (TFLite), ONNX, and PyTorch with MobileNetV2, VGG-19, and DenseNet-121. The single-stream, multi-stream, and offline scenarios of the MLPerf Edge Inference benchmarks are used for measuring latency and throughput in our experiments. Our evaluation reveals that OpenVINO and TFLite are the most optimized frameworks for Intel CPUs and Raspberry Pi device, respectively. We observe no loss in accuracy except for the static quantization techniques. We also observed the benefits of using quantization for these optimized frameworks. For example, INT8-based quantized models deliver $3.3\times$ and $4\times$ better performance over FP32 using OpenVINO on Intel CPU and TFLite on Raspberry Pi device, respectively, for the MLPerf offline scenario. To the best of our knowledge, this paper is the first one that presents a unique characterization study characterizing the impact of quantization for a range of DNN inference frameworks -- including OpenVINO, TFLite, PyTorch, and ONNX -- on Intel x86 processors and Raspberry Pi device with ARM processor using the MLPerf Edge Inference benchmark methodology.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper aims to provide a detailed performance characterization of deep neural network (DNN) inference on edge devices using quantization techniques. Specifically, the paper focuses on the impact of different quantization methods (including static and dynamic schemes) such as FP16/INT8 on the size and overall numerical performance of DNN models. The study primarily uses the MLPerf Edge Inference benchmark and conducts experiments on Intel x86 processors and Raspberry Pi (with ARM processors). ### Main Research Content 1. **Application of Quantization Techniques**: - The paper explores how quantization techniques can be used to reduce the size of DNN models and improve their inference performance on edge devices. - It investigates different precision formats for quantization, such as FP16 and INT8, as well as static and dynamic quantization schemes. 2. **Experimental Platforms and Frameworks**: - Experiments were conducted on Intel x86 processors and Raspberry Pi (with ARM processors). - Multiple DNN inference frameworks were used, including OpenVINO (Intel CPU only), TensorFlow Lite (TFLite), ONNX, and PyTorch. - MobileNetV2, VGG-19, and DenseNet-121 were selected as the experimental models. 3. **Performance Evaluation Metrics**: - The study evaluated the latency and throughput of different quantization methods in single-stream, multi-stream, and offline scenarios. - It focused on analyzing changes in model accuracy and size. ### Main Findings 1. **Model Size and Accuracy**: - The size of models quantized to INT8 format was reduced by 75%, while models quantized to FP16 format were reduced by 50%. - Except for static quantization techniques (INT8-SQ), other quantization methods performed well in maintaining model accuracy. 2. **Performance Improvement**: - On Intel CPUs, INT8 quantized models using the OpenVINO framework showed a 3.3x performance improvement over FP32 models in MLPerf offline scenarios. - On Raspberry Pi devices, INT8 quantized models using the TFLite framework showed a 4x performance improvement over FP32 models in MLPerf offline scenarios. 3. **Framework Optimization**: - OpenVINO and TFLite performed best on Intel CPUs and Raspberry Pi devices, respectively. - Dynamic quantization (DQ) methods were slower than FP32 in all cases because they require runtime computation of scaling factors. ### Conclusion This paper demonstrates the effectiveness of quantization techniques in DNN inference through detailed performance characterization, particularly in enhancing performance on edge devices. The study's results indicate that quantization techniques can significantly reduce model size and improve inference performance without substantially affecting model accuracy. These findings are crucial for promoting the deployment of AI models in resource-constrained environments.

Performance Characterization of using Quantization for DNN Inference on Edge Devices: Extended Version

Pse: Mixed Quantization Framework of Neural Networks for Efficient Deployment

On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks

FP8 versus INT8 for efficient deep learning inference

Efficient Execution of Quantized Deep Learning Models: A Compiler Approach

Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation

Development of Quantized DNN Library for Exact Hardware Emulation

Optimizing convolutional neural networks for IoT devices: performance and energy efficiency of quantization techniques

A fine-grained mixed precision DNN accelerator using a two-stage big-little core RISC-V MCU.

Efficient 8-Bit Quantization of Transformer Neural Machine Language Translation Model

Performance Characterization of Containerized DNN Training and Inference on Edge Accelerators

DyBit: Dynamic Bit-Precision Numbers for Efficient Quantized Neural Network Inference

Low Power Inference for On-Device Visual Recognition with a Quantization-Friendly Solution.

Training High-Performance and Large-Scale Deep Neural Networks with Full 8-Bit Integers.

Demystifying TensorRT: Characterizing Neural Network Inference Engine on Nvidia Edge Devices

Edge-MPQ: Layer-Wise Mixed-Precision Quantization With Tightly Integrated Versatile Inference Units for Edge Computing

A Mixed-Precision RISC-V Processor for Extreme-Edge DNN Inference

HAWQV3: Dyadic Neural Network Quantization

BenQ: Benchmarking Automated Quantization on Deep Neural Network Accelerators

Fully Integer-Based Quantization for Mobile Convolutional Neural Network Inference