Abstract:Deep neural networks (DNNs) are state-of-the-art solutions for many machine learning applications, and have been widely used on mobile devices. Running DNNs on resource-constrained mobile devices often requires the help from edge servers via computation offloading. However, offloading through a bandwidth-limited wireless link is non-trivial due to the tight interplay between the computation resources on mobile devices and wireless resources. Existing studies have focused on cooperative inference where DNN models are partitioned at different neural network layers, and the two parts are executed at the mobile device and the edge server, respectively. Since the output data size of a DNN layer can be larger than that of the raw data, offloading intermediate data between layers can suffer from high transmission latency under limited wireless bandwidth. In this paper, we propose an efficient and flexible 2-step pruning framework for DNN partition between mobile devices and edge servers. In our framework, the DNN model only needs to be pruned once in the training phase where unimportant convolutional filters are removed iteratively. By limiting the pruning region, our framework can greatly reduce either the wireless transmission workload of the device or the total computation workload. A series of pruned models are generated in the training phase, from which the framework can automatically select to satisfy varying latency and accuracy requirements. Furthermore, coding for the intermediate data is added to provide extra transmission workload reduction. Our experiments show that the proposed framework can achieve up to 25.6$\times$ reduction on transmission workload, 6.01$\times$ acceleration on total computation and 4.81$\times$ reduction on end-to-end latency as compared to partitioning the original DNN model without pruning.

Reinforcement learning-based dynamic pruning for distributed inference via explainable AI in healthcare IoT systems

Class-Aware Pruning for Efficient Neural Networks

RL-DistPrivacy: Privacy-Aware Distributed Deep Inference for low latency IoT systems

Pruning the Way to Reliable Policies: A Multi-Objective Deep Q-Learning Approach to Critical Care

Prune2Edge: A Multi-Phase Pruning Pipelines to Deep Ensemble Learning in IIoT

Utilizing Explainable AI for Quantization and Pruning of Deep Neural Networks

Cloud–Edge Collaborative Inference with Network Pruning

Structured Deep Neural Network Pruning via Matrix Pivoting

Federated Inverse Reinforcement Learning for Smart ICUs with Differential Privacy

Structured Model Pruning for Efficient Inference in Computational Pathology

An efficient pruning scheme of deep neural networks for Internet of Things applications

Edge-PRUNE: Flexible Distributed Deep Learning Inference

Improving Device-Edge Cooperative Inference of Deep Learning via 2-Step Pruning

Archtree: on-the-fly tree-structured exploration for latency-aware pruning of deep neural networks

Distributed Assignment With Load Balancing for DNN Inference at the Edge

Explainable AI for Medical Data: Current Methods, Limitations, and Future Directions

Explainable Artificial Intelligence for Predictive Modeling in Healthcare

Rapid Deployment of DNNs for Edge Computing via Structured Pruning at Initialization

Kernel-Based Distributed Q-Learning: A Scalable Reinforcement Learning Approach for Dynamic Treatment Regimes

A new growing pruning deep learning neural network algorithm (GP-DLNN)

From Explainable to Interpretable Deep Learning for Natural Language Processing in Healthcare: How Far from Reality?