Abstract:Deep neural networks (DNNs) are state-of-the-art solutions for many machine learning applications, and have been widely used on mobile devices. Running DNNs on resource-constrained mobile devices often requires the help from edge servers via computation offloading. However, offloading through a bandwidth-limited wireless link is non-trivial due to the tight interplay between the computation resources on mobile devices and wireless resources. Existing studies have focused on cooperative inference where DNN models are partitioned at different neural network layers, and the two parts are executed at the mobile device and the edge server, respectively. Since the output data size of a DNN layer can be larger than that of the raw data, offloading intermediate data between layers can suffer from high transmission latency under limited wireless bandwidth. In this paper, we propose an efficient and flexible 2-step pruning framework for DNN partition between mobile devices and edge servers. In our framework, the DNN model only needs to be pruned once in the training phase where unimportant convolutional filters are removed iteratively. By limiting the pruning region, our framework can greatly reduce either the wireless transmission workload of the device or the total computation workload. A series of pruned models are generated in the training phase, from which the framework can automatically select to satisfy varying latency and accuracy requirements. Furthermore, coding for the intermediate data is added to provide extra transmission workload reduction. Our experiments show that the proposed framework can achieve up to 25.6$\times$ reduction on transmission workload, 6.01$\times$ acceleration on total computation and 4.81$\times$ reduction on end-to-end latency as compared to partitioning the original DNN model without pruning.

Adaptive Compression-Aware Split Learning and Inference for Enhanced Network Efficiency

Extendable Multi-Device Collaborative Pipeline Parallel Inference in the Edge-Cloud Scenario

On-Demand Deep Model Compression for Mobile Devices

Pruning at a Glance: Global Neural Pruning for Model Compression

Adaptive Deep Inference Framework for Cloud-Edge Collaboration

Understanding Sensor Data Using Deep Learning Methods on Resource-Constrained Edge Devices.

Efficient Deep Learning Inference Based on Model Compression.

Edge-PRUNE: Flexible Distributed Deep Learning Inference

Model Pruning-enabled Federated Split Learning for Resource-constrained Devices in Artificial Intelligence Empowered Edge Computing Environment

Split Learning in Wireless Networks: A Communication and Computation Adaptive Scheme

Cloud–Edge Collaborative Inference with Network Pruning

Improving Device-Edge Cooperative Inference of Deep Learning via 2-Step Pruning

Using Distillation to Improve Network Performance after Pruning and Quantization

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

Enabling Deep Learning on Edge Devices through Filter Pruning and Knowledge Transfer

A Novel Deep Learning Model Compression Algorithm

FrankenSplit: Efficient Neural Feature Compression with Shallow Variational Bottleneck Injection for Mobile Edge Computing

A Novel Adaptive Gradient Compression Scheme: Reducing the Communication Overhead for Distributed Deep Learning in the Internet of Things

LCS: Learning Compressible Subspaces for Adaptive Network Compression at Inference Time

A Multi-task Supervised Compression Model for Split Computing

ACP: Adaptive Channel Pruning for Efficient Neural Networks.