Abstract:Deep neural network (DNN) inference is increasingly being executed on mobile and embedded platforms due to several key advantages in latency, privacy and always-on availability. However, due to limited computing resources, efficient DNN deployment on mobile and embedded platforms is challenging. Although many hardware accelerators and static model compression methods were proposed by previous works, at system runtime, multiple applications are typically executed concurrently and compete for hardware resources. This raises two main challenges: Runtime Hardware Availability and Runtime Application Variability. Previous works have addressed these challenges through either dynamic neural networks that contain sub-networks with different performance trade-offs or runtime hardware resource management. In this thesis, we proposed a combined method, a system was developed for DNN performance trade-off management, combining the runtime trade-off opportunities in both algorithms and hardware to meet dynamically changing application performance targets and hardware constraints in real time. We co-designed novel Dynamic Super-Networks to maximise runtime system-level performance and energy efficiency on heterogeneous hardware platforms. Compared with SOTA, our experimental results using ImageNet on the GPU of Jetson Xavier NX show our model is 2.4x faster for similar ImageNet Top-1 accuracy, or 5.1% higher accuracy at similar latency. We also designed a hierarchical runtime resource manager that tunes both dynamic neural networks and DVFS at runtime. Compared with the Linux DVFS governor schedutil, our runtime approach achieves up to a 19% energy reduction and a 9% latency reduction in single model deployment scenario, and an 89% energy reduction and a 23% latency reduction in a two concurrent model deployment scenario.

Ultra-Low-Latency Distributed Deep Neural Network over Hierarchical Mobile Networks

Efficient Partitioning and Communication Scheme-Based Distributed Edge Computing to Accelerate Deep Neural Network

Extendable Multi-Device Collaborative Pipeline Parallel Inference in the Edge-Cloud Scenario

Distributed Inference Acceleration with Adaptive DNN Partitioning and Offloading

Adaptive Partitioning and Efficient Scheduling for Distributed DNN Training in Heterogeneous IoT Environment

HiTDL: High-Throughput Deep Learning Inference at the Hybrid Mobile Edge

Accelerating Deep Learning Inference via Model Parallelism and Partial Computation Offloading

Optimum splitting computing for DNN training through next generation smart networks: a multi-tier deep reinforcement learning approach

Pre-DNNOff: On-Demand DNN Model Offloading Method for Mobile Edge Computing

Inference Time Optimization Using BranchyNet Partitioning

HierTrain: Fast Hierarchical Edge AI Learning with Hybrid Parallelism in Mobile-Edge-Cloud Computing

HiDP: Hierarchical DNN Partitioning for Distributed Inference on Heterogeneous Edge Platforms

An Adaptive DNN Inference Acceleration Framework with End–edge–cloud Collaborative Computing

Resource-aware Deployment of Dynamic DNNs over Multi-tiered Interconnected Systems

Communication-Efficient Separable Neural Network for Distributed Inference on Edge Devices

Distributed Assignment With Load Balancing for DNN Inference at the Edge

Hierarchical Training of Deep Neural Networks Using Early Exiting

The Case for Hierarchical Deep Learning Inference at the Network Edge

Dynamic DNN Decomposition for Lossless Synergistic Inference

Dynamic DNNs and Runtime Management for Efficient Inference on Mobile/Embedded Devices

Automated Deep Neural Network Inference Partitioning for Distributed Embedded Systems