Abstract:Modern deep-learning models tend to include billions of parameters, reducing real-time performance. Embedded systems are compute-constrained while frequently used to deploy these models for real-time systems given size, weight, and power requirements. Tools like parameter-scaling methods help to shrink models to ease deployment. This research compares two scaling methods for convolutional neural networks, uniform scaling and NeuralScale, and analyzes their impact on inference latency, memory utilization, and power. Uniform scaling scales the number of filters evenly across a network. NeuralScale adaptively scales the model to theoretically achieve the highest accuracy for a target parameter count. In this study, VGG-11, MobileNetV2, and ResNet-50 models were scaled to four ratios: 0.25 ×, 0.50 ×, 0.75 ×, 1.00 ×. These models were benchmarked on an ARM Cortex-A72 CPU, an NVIDIA Jetson AGX Xavier GPU, and a Xilinx ZCU104 FPGA. Additionally, quantization was applied to meet real-time objectives. The CIFAR-10 and tinyImageNet datasets were studied. On CIFAR-10, NeuralScale creates more computationally intensive models than uniform scaling for the same parameter count, with relative speeds of 41% on the CPU, 72% on the GPU, and 96% on the FPGA. The additional computational complexity is a tradeoff for accuracy improvements in VGG-11 and MobileNetV2 NeuralScale models but reduced ResNet-50 NeuralScale accuracy. Furthermore, quantization alone achieves similar or better performance on the CPU and GPU devices when compared to models scaled to 0.50 ×, despite slight reductions in accuracy. On the GPU, quantization reduces latency by 2.7 × and memory consumption by 4.3 ×. Uniform-scaling models are 1.8 × faster and use 2.8 × less memory. NeuralScale reduces latency by 1.3 × and dropped memory by 1.1 ×. We find quantization to be a practical first tool for improved performance. Uniform scaling can easily be applied for additional improvements. NeuralScale may improve accuracy but tends to negatively impact performance, so more care must be taken with it.

HACScale: Hardware-Aware Compound Scaling for Resource-Efficient DNNs

NeuralScale: Efficient Scaling of Neurons for Resource-Constrained Deep Neural Networks

Scaling Neural Network Performance through Customized Hardware Architectures on Reconfigurable Logic

AdaScale: Dynamic Context-aware DNN Scaling via Automated Adaptation Loop on Mobile Devices

Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training

ScaleNet: Searching for the Model to Scale.

Hardware-Aware Softmax Approximation for Deep Neural Networks

Multi-Compression Scale DNN Inference Acceleration based on Cloud-Edge-End Collaboration

Beyond Uniform Scaling: Exploring Depth Heterogeneity in Neural Architectures

HAO: Hardware-aware neural Architecture Optimization for Efficient Inference

Data-Driven Neuron Allocation for Scale Aggregation Networks

ScaleNAS: One-Shot Learning of Scale-Aware Representations for Visual Recognition

VS-Quant: Per-vector Scaled Quantization for Accurate Low-Precision Neural Network Inference

Characterizing Parameter Scaling with Quantization for Deployment of CNNs on Real-Time Systems

Scale-Net: Learning to Reduce Scale Differences for Large-Scale Invariant Image Matching

A DNN Optimization Framework with Unlabeled Data for Efficient and Accurate Reconfigurable Hardware Inference

Scale Attention for Learning Deep Face Representation: A Study Against Visual Scale Variation

EH-DNAS: End-to-End Hardware-aware Differentiable Neural Architecture Search

A High Utilization FPGA-Based Accelerator for Variable-Scale Convolutional Neural Network

HALOC: Hardware-Aware Automatic Low-Rank Compression for Compact Neural Networks

HCM: Hardware-Aware Complexity Metric for Neural Network Architectures