Abstract:We have witnessed an increasing usage of multi-modal DNNs with multi-task heads on edge computing scenarios. These networks typically process inputs of different modalities first, then extract features for unified fusion, and finally input the fused features into multi-task heads. Such networks are often used to determine pose and navigate movement direction via multi-modal data obtained from diverse sensory equipment, therefore necessitating low inference latency. An edge device cluster with high-speed interconnection can be employed to support such DNN workload for scaled-out performance.For accelerating model inference on edge devices, previous researchers have proposed methods including model pruning, quantization, etc. However, these methods failed to take advantage of the structural features of multi-modal DNNs with multi-task heads and may impair the model’s prediction accuracy.Based on the intrinsic structure of multi-modal DNNs with multi-task heads, we propose Sub-model Parallelism to achieve scalable execution speedup. Sub-model Parallelism is a scale-out deployment method that first assigns preprocessing tasks of different modalities to different edge devices, then delivers them to a device for modality feature fusion, and finally distributes the fused features to other devices responsible for different task head computations. We run experiments on BEVFusion network and achieve an approximately 30% reduction in latency using two Jetson Orin devices connected by Remote Direct Memory Access (RDMA). Furthermore, we conduct a series of simulation experiments to cover scale-out scenarios and also achieve a good level of latency reduction. We hope that our proposed method can provide valuable experience for the optimized scale-out deployment of large multi-modal DNNs with multi-task heads on multiple edge devices.

Sub-model Parallelism: A Scale-out Deployment Method for Large Multi-modal DNNs

Extendable Multi-Device Collaborative Pipeline Parallel Inference in the Edge-Cloud Scenario

Condense: A Framework for Device and Frequency Adaptive Neural Network Models on the Edge.

Unlocking the Non-deterministic Computing Power with Memory-Elastic Multi-Exit Neural Networks

AccEPT: an Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training

EdgeSP: Scalable Multi-device Parallel DNN Inference on Heterogeneous Edge Clusters

Multi-Model Running Latency Optimization in an Edge Computing Paradigm

Accelerating Deep Learning Inference via Model Parallelism and Partial Computation Offloading

Multi-Compression Scale DNN Inference Acceleration based on Cloud-Edge-End Collaboration

NN-Stretch: Automatic Neural Network Branching for Parallel Inference on Heterogeneous Multi-Processors.

Resource-efficient Parallel Split Learning in Heterogeneous Edge Computing

Distributed Deep Learning Inference Acceleration using Seamless Collaboration in Edge Computing

DVFO: Learning-Based DVFS for Energy-Efficient Edge-Cloud Collaborative Inference

Collaborative edge computing for distributed CNN inference acceleration using receptive field-based segmentation

An Adaptive DNN Inference Acceleration Framework with End–edge–cloud Collaborative Computing

A High-Performance Dataflow-Centric Optimization Framework for Deep Learning Inference on the Edge

Communication-Efficient Separable Neural Network for Distributed Inference on Edge Devices

Design and Prototyping Distributed CNN Inference Acceleration in Edge Computing

Edge Intelligence: On-Demand Deep Learning Model Co-Inference with Device-Edge Synergy

Asteroid: Resource-Efficient Hybrid Pipeline Parallelism for Collaborative DNN Training on Heterogeneous Edge Devices

ParallelFusion