Characterizing Parameter Scaling with Quantization for Deployment of CNNs on Real-Time Systems
Calvin B. Gealy,Alan D. George
DOI: https://doi.org/10.1145/3654799
2024-03-30
ACM Transactions on Embedded Computing Systems
Abstract:Modern deep-learning models tend to include billions of parameters, reducing real-time performance. Embedded systems are compute-constrained while frequently used to deploy these models for real-time systems given size, weight, and power requirements. Tools like parameter-scaling methods help to shrink models to ease deployment. This research compares two scaling methods for convolutional neural networks, uniform scaling and NeuralScale, and analyzes their impact on inference latency, memory utilization, and power. Uniform scaling scales the number of filters evenly across a network. NeuralScale adaptively scales the model to theoretically achieve the highest accuracy for a target parameter count. In this study, VGG-11, MobileNetV2, and ResNet-50 models were scaled to four ratios: 0.25 ×, 0.50 ×, 0.75 ×, 1.00 ×. These models were benchmarked on an ARM Cortex-A72 CPU, an NVIDIA Jetson AGX Xavier GPU, and a Xilinx ZCU104 FPGA. Additionally, quantization was applied to meet real-time objectives. The CIFAR-10 and tinyImageNet datasets were studied. On CIFAR-10, NeuralScale creates more computationally intensive models than uniform scaling for the same parameter count, with relative speeds of 41% on the CPU, 72% on the GPU, and 96% on the FPGA. The additional computational complexity is a tradeoff for accuracy improvements in VGG-11 and MobileNetV2 NeuralScale models but reduced ResNet-50 NeuralScale accuracy. Furthermore, quantization alone achieves similar or better performance on the CPU and GPU devices when compared to models scaled to 0.50 ×, despite slight reductions in accuracy. On the GPU, quantization reduces latency by 2.7 × and memory consumption by 4.3 ×. Uniform-scaling models are 1.8 × faster and use 2.8 × less memory. NeuralScale reduces latency by 1.3 × and dropped memory by 1.1 ×. We find quantization to be a practical first tool for improved performance. Uniform scaling can easily be applied for additional improvements. NeuralScale may improve accuracy but tends to negatively impact performance, so more care must be taken with it.
computer science, software engineering, hardware & architecture