Performance Characterization of using Quantization for DNN Inference on Edge Devices: Extended Version

Hyunho Ahn,Tian Chen,Nawras Alnaasan,Aamir Shafi,Mustafa Abduljabbar,Hari Subramoni,Dhabaleswar K.,Panda
2023-03-09
Abstract:Quantization is a popular technique used in Deep Neural Networks (DNN) inference to reduce the size of models and improve the overall numerical performance by exploiting native hardware. This paper attempts to conduct an elaborate performance characterization of the benefits of using quantization techniques -- mainly FP16/INT8 variants with static and dynamic schemes -- using the MLPerf Edge Inference benchmarking methodology. The study is conducted on Intel x86 processors and Raspberry Pi device with ARM processor. The paper uses a number of DNN inference frameworks, including OpenVINO (for Intel CPUs only), TensorFlow Lite (TFLite), ONNX, and PyTorch with MobileNetV2, VGG-19, and DenseNet-121. The single-stream, multi-stream, and offline scenarios of the MLPerf Edge Inference benchmarks are used for measuring latency and throughput in our experiments. Our evaluation reveals that OpenVINO and TFLite are the most optimized frameworks for Intel CPUs and Raspberry Pi device, respectively. We observe no loss in accuracy except for the static quantization techniques. We also observed the benefits of using quantization for these optimized frameworks. For example, INT8-based quantized models deliver $3.3\times$ and $4\times$ better performance over FP32 using OpenVINO on Intel CPU and TFLite on Raspberry Pi device, respectively, for the MLPerf offline scenario. To the best of our knowledge, this paper is the first one that presents a unique characterization study characterizing the impact of quantization for a range of DNN inference frameworks -- including OpenVINO, TFLite, PyTorch, and ONNX -- on Intel x86 processors and Raspberry Pi device with ARM processor using the MLPerf Edge Inference benchmark methodology.
Performance,Signal Processing
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper aims to provide a detailed performance characterization of deep neural network (DNN) inference on edge devices using quantization techniques. Specifically, the paper focuses on the impact of different quantization methods (including static and dynamic schemes) such as FP16/INT8 on the size and overall numerical performance of DNN models. The study primarily uses the MLPerf Edge Inference benchmark and conducts experiments on Intel x86 processors and Raspberry Pi (with ARM processors). ### Main Research Content 1. **Application of Quantization Techniques**: - The paper explores how quantization techniques can be used to reduce the size of DNN models and improve their inference performance on edge devices. - It investigates different precision formats for quantization, such as FP16 and INT8, as well as static and dynamic quantization schemes. 2. **Experimental Platforms and Frameworks**: - Experiments were conducted on Intel x86 processors and Raspberry Pi (with ARM processors). - Multiple DNN inference frameworks were used, including OpenVINO (Intel CPU only), TensorFlow Lite (TFLite), ONNX, and PyTorch. - MobileNetV2, VGG-19, and DenseNet-121 were selected as the experimental models. 3. **Performance Evaluation Metrics**: - The study evaluated the latency and throughput of different quantization methods in single-stream, multi-stream, and offline scenarios. - It focused on analyzing changes in model accuracy and size. ### Main Findings 1. **Model Size and Accuracy**: - The size of models quantized to INT8 format was reduced by 75%, while models quantized to FP16 format were reduced by 50%. - Except for static quantization techniques (INT8-SQ), other quantization methods performed well in maintaining model accuracy. 2. **Performance Improvement**: - On Intel CPUs, INT8 quantized models using the OpenVINO framework showed a 3.3x performance improvement over FP32 models in MLPerf offline scenarios. - On Raspberry Pi devices, INT8 quantized models using the TFLite framework showed a 4x performance improvement over FP32 models in MLPerf offline scenarios. 3. **Framework Optimization**: - OpenVINO and TFLite performed best on Intel CPUs and Raspberry Pi devices, respectively. - Dynamic quantization (DQ) methods were slower than FP32 in all cases because they require runtime computation of scaling factors. ### Conclusion This paper demonstrates the effectiveness of quantization techniques in DNN inference through detailed performance characterization, particularly in enhancing performance on edge devices. The study's results indicate that quantization techniques can significantly reduce model size and improve inference performance without substantially affecting model accuracy. These findings are crucial for promoting the deployment of AI models in resource-constrained environments.