Abstract:Deep Neural Networks (DNNs) based on intelligent applications have been intensively deployed on mobile devices. Unfortunately, resource-constrained mobile devices cannot meet stringent latency requirements due to a large amount of computation required by these intelligent applications. Both exiting cloud-assisted DNN inference approaches and edge-assisted DNN inference approaches can reduce end-to-end inference latency through offloading DNN computations to the cloud server or edge servers, but they suffer from unpredictable communication latency caused by long wide-area massive data transmission or performance degeneration caused by the limited computation resources. In this paper, we propose an adaptive DNN inference acceleration framework, which accelerates DNN inference by fully utilizing the end–edge–cloud collaborative computing. First, a latency prediction model is built to estimate the layer-wise execution latency of a DNN on different heterogeneous computing platforms, which use neural networks to learn non-linear features related to inference latency. Second, a computation partitioning algorithm is designed to identify two optimal partitioning points, which adaptively divide DNN computations into end devices, edge servers, and the cloud server for minimizing DNN inference latency. Finally, we conduct extensive experiments on three widely-adopted DNNs, and the experimental results show that our latency prediction models can improve the prediction accuracy by about 72.31% on average compared with four baseline approaches, and our computation partitioning approach can reduce the end-to-end latency by about 20.81% on average against six baseline approaches under three wireless networks.

Reducing Data Motion to Accelerate the Training of Deep Neural Networks

DaDianNao: A Machine-Learning Supercomputer

Reducing Data Bottlenecks in Distributed, Heterogeneous Neural Networks

AccEPT: an Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training

Optimizing Network Performance for Distributed DNN Training on GPU Clusters: ImageNet/AlexNet Training in 1.5 Minutes

NetReduce: RDMA-Compatible In-Network Reduction for Distributed DNN Training Acceleration

Accelerating Data Loading in Deep Neural Network Training

Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform

Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks

CATERPILLAR: Coarse Grain Reconfigurable Architecture for Accelerating the Training of Deep Neural Networks

RedSync: Reducing Synchronization Bandwidth for Distributed Deep Learning Training System

STANNIS: Low-Power Acceleration of Deep Neural Network Training Using Computational Storage

Accelerating Deep Neural Networks by Combining Block-Circulant Matrices and Low-Precision Weights

A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters

ACCELERATING THE TRAINING OF ARTIFICIAL NEURAL NETWORKS USING DATA PARALLELIZATION

RedSync : Reducing Synchronization Traffic for Distributed Deep Learning

Accelerating neural network training with distributed asynchronous and selective optimization (DASO)

Near-Lossless Gradient Compression for Data-Parallel Distributed DNN Training

Stitching Weight-Shared Deep Neural Networks for Efficient Multitask Inference on GPU

An Adaptive DNN Inference Acceleration Framework with End–edge–cloud Collaborative Computing

Accelerating Distributed Deep Learning using Lossless Homomorphic Compression