Abstract:The deployment of deep learning architectures on low-computational resource devices is challenging due to their high number of parameters and computational complexity. These heavy and complex architectures result in increased latency in real-time applications. However, splitting the deep architecture in a way that parallelizes the forward propagation into different subnets deploying into multiple low-computational resource devices, and then, aggregating the predictions may reduce the latency while preserving the performance. In this paper, we propose a novel deep learning architecture called Ensembled Parallel Networks (EnParaNets) that leverage network dissection, knowledge distillation, and ensemble learning strategies to reduce inference time while maintaining, even in some cases, outperforming the baseline accuracy in real-time applications. The methodology involves splitting the original network into N equal-sized blocks, forming N Sub-ParaNets for each block, and enhancing their representations using (A) contrastive knowledge distillation along with reducing Kullback–Leibler divergence between logits distributions of the teacher and student networks, and (B) L2 loss between intermediate representations of the original network and corresponding Sub-ParaNets. Predictive distributions from each Sub-ParaNet are assembled to form the final EnParaNet. The proposed EnParaNet outperforms the baseline models of seven diverse architectures: ResNet56, VGG_13, WRN_40_2, DenseNet, ResNeXt50, MobileNetv2, and ShuffleNetv2 in terms of accuracy while reducing inference time significantly using training methods A and B, respectively. Our proposed EnParaNet enhances ResNet56, VGG_13, WRN_40_2, MobileNetv2, DenseNet, ResNeXt50, and ShuffleNetv2 by 2.69%, 0.24%, 1.95%, 7.69%, 0.33%, 2.13%, and 3.12%, respectively, using training method A, where the inference time is reduced by 45%, 24%, 47%, 31%, 33%, 32%, and 44%, respectively. With training method B, EnParaNet achieves improvements of 1.75%, 2.90%, 1.09%, 3.91%, and 1.66%, with inference time reductions of 50%, 42%, 49%, 48%, and 49%, respectively. Moreover, a comprehensive ablation study analyzes the performance of the proposed technique and highlights its effectiveness and challenges. Furthermore, we also evaluate the performance of EnParaNet in transferability and adversarial robustness tasks.

ESEN: Efficient GPU Sharing of Ensemble Neural Networks

Extendable Multi-Device Collaborative Pipeline Parallel Inference in the Edge-Cloud Scenario

An efficient and flexible inference system for serving heterogeneous ensembles of deep neural networks

Adaptive Partitioning and Efficient Scheduling for Distributed DNN Training in Heterogeneous IoT Environment

ENLARGE: an Efficient SNN Simulation Framework on GPU Clusters

EnParaNet: a novel deep learning architecture for faster prediction using low-computational resource devices

Multi-user Co-inference with Batch Processing Capable Edge Server

AccEPT: an Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training

ESG: Pipeline-Conscious Efficient Scheduling of DNN Workflows on Serverless Platforms with Shareable GPUs

A Deep Neural Networks ensemble workflow from hyperparameter search to inference leveraging GPU clusters

Optimizing Mixture-of-Experts Inference Time Combining Model Deployment and Communication Scheduling

Automated Runtime-Aware Scheduling for Multi-Tenant DNN Inference on GPU

iGniter: Interference-Aware GPU Resource Provisioning for Predictable DNN Inference in the Cloud

MoESys: A Distributed and Efficient Mixture-of-Experts Training and Inference System for Internet Services

Parm: Efficient Training of Large Sparsely-Activated Models with Dedicated Schedules

Optimizing execution for pipelined‐based distributed deep learning in a heterogeneously networked GPU cluster

NeuE: Automated Neural Network Ensembles for Edge Intelligence

AEML: An Acceleration Engine for Multi-GPU Load-balancing in Distributed Heterogeneous Environment

Efficient Post-Training Augmentation for Adaptive Inference in Heterogeneous and Distributed IoT Environments

Exploiting Student Parallelism for Low-latency GPU Inference of BERT-like Models in Online Services

Joint Configuration Optimization and GPU Allocation for Multi-Tenant Real-Time Video Analytics on Resource-Constrained Edge