Dynamic DNNs and Runtime Management for Efficient Inference on Mobile/Embedded Devices

Lei Xun,Jonathon Hare,Geoff V. Merrett

2024-01-17

Abstract:Deep neural network (DNN) inference is increasingly being executed on mobile and embedded platforms due to several key advantages in latency, privacy and always-on availability. However, due to limited computing resources, efficient DNN deployment on mobile and embedded platforms is challenging. Although many hardware accelerators and static model compression methods were proposed by previous works, at system runtime, multiple applications are typically executed concurrently and compete for hardware resources. This raises two main challenges: Runtime Hardware Availability and Runtime Application Variability. Previous works have addressed these challenges through either dynamic neural networks that contain sub-networks with different performance trade-offs or runtime hardware resource management. In this thesis, we proposed a combined method, a system was developed for DNN performance trade-off management, combining the runtime trade-off opportunities in both algorithms and hardware to meet dynamically changing application performance targets and hardware constraints in real time. We co-designed novel Dynamic Super-Networks to maximise runtime system-level performance and energy efficiency on heterogeneous hardware platforms. Compared with SOTA, our experimental results using ImageNet on the GPU of Jetson Xavier NX show our model is 2.4x faster for similar ImageNet Top-1 accuracy, or 5.1% higher accuracy at similar latency. We also designed a hierarchical runtime resource manager that tunes both dynamic neural networks and DVFS at runtime. Compared with the Linux DVFS governor schedutil, our runtime approach achieves up to a 19% energy reduction and a 9% latency reduction in single model deployment scenario, and an 89% energy reduction and a 23% latency reduction in a two concurrent model deployment scenario.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper is primarily dedicated to addressing the issue of efficiently deploying deep neural networks (DNNs) on mobile and embedded platforms. Specifically, the paper proposes corresponding solutions to the following two main challenges: 1. **Runtime Hardware Resource Availability**: Modern system-on-chips (SoCs) include components such as CPUs, GPUs, and NPUs, and face fluctuations in hardware resource availability at runtime. These fluctuations make it difficult to consistently meet performance targets, as hardware resources are unknown and constantly changing during the initial compression phase. 2. **Runtime Application Variability**: A single DNN model can support multiple application scenarios (such as translation, text generation, and chatbots), each with different performance requirements. For example, chatbots require low latency for quick responses, while translation and text generation focus more on accuracy. These performance demands may also change at runtime due to user settings or preferences, posing significant challenges during the design phase. To address the above challenges, the authors propose a method that combines dynamic neural networks with runtime hardware resource management. This method includes: - Developing a Dynamic Super-network that can sample efficient sub-networks directly from the backbone network, constructing dynamic neural networks without the need for retraining. - Designing a multi-level runtime management system that adjusts algorithms and hardware parameters in real-time to meet dynamically changing application performance goals and hardware constraints. Experimental results show that this method achieves higher efficiency and better performance on the Jetson Xavier NX platform compared to existing methods. Additionally, the method significantly improves energy consumption and latency, especially in scenarios with concurrent multi-task execution.

Dynamic DNNs and Runtime Management for Efficient Inference on Mobile/Embedded Devices

Incremental Training and Group Convolution Pruning for Runtime DNN Performance Scaling on Heterogeneous Embedded Platforms

An Efficient and Flexible Learning Framework for Dynamic Power and Thermal Co-Management

MOC: Multi-Objective Mobile CPU-GPU Co-Optimization for Power-Efficient DNN Inference

All-in-One: A Highly Representative DNN Pruning Framework for Edge Devices with Dynamic Power Management

High-performance Reconfigurable DNN Accelerator on a Bandwidth-limited Embedded System

Heterogeneous Scheduling of Deep Neural Networks for Low-power Real-time Designs

Automated Runtime-Aware Scheduling for Multi-Tenant DNN Inference on GPU

Efficient CUDA stream management for multi-DNN real-time inference on embedded GPUs

Resource-aware Deployment of Dynamic DNNs over Multi-tiered Interconnected Systems

Accelerate Intermittent Deep Inference

26ms Inference Time for ResNet-50: Towards Real-Time Execution of all DNNs on Smartphone

Enabling High Performance Deep Learning Networks on Embedded Systems

A Power Efficient Neural Network Implementation on Heterogeneous FPGA and GPU Devices

Joint Architecture Design and Workload Partitioning for DNN Inference on Industrial IoT Clusters

Cavs: An Efficient Runtime System For Dynamic Neural Networks

HiTDL: High-Throughput Deep Learning Inference at the Hybrid Mobile Edge

Towards Real-Time DNN Inference on Mobile Platforms with Model Pruning and Compiler Optimization

AutoScale: Optimizing Energy Efficiency of End-to-End Edge Inference under Stochastic Variance

TensorRT-based Framework and Optimization Methodology for Deep Learning Inference on Jetson Boards

Thermal-Aware Scheduling for Deep Learning on Mobile Devices With NPU