Sub-model Parallelism: A Scale-out Deployment Method for Large Multi-modal DNNs

Tianhao Huang,Lingyu Sun,Xiaofeng Hou,Xiaozhi Zhu,Xinfeng Xia,Yutong Wang,Mingxi Chen,Chao Li
DOI: https://doi.org/10.1109/ccgrid59990.2024.00081
2024-01-01
Abstract:We have witnessed an increasing usage of multi-modal DNNs with multi-task heads on edge computing scenarios. These networks typically process inputs of different modalities first, then extract features for unified fusion, and finally input the fused features into multi-task heads. Such networks are often used to determine pose and navigate movement direction via multi-modal data obtained from diverse sensory equipment, therefore necessitating low inference latency. An edge device cluster with high-speed interconnection can be employed to support such DNN workload for scaled-out performance.For accelerating model inference on edge devices, previous researchers have proposed methods including model pruning, quantization, etc. However, these methods failed to take advantage of the structural features of multi-modal DNNs with multi-task heads and may impair the model’s prediction accuracy.Based on the intrinsic structure of multi-modal DNNs with multi-task heads, we propose Sub-model Parallelism to achieve scalable execution speedup. Sub-model Parallelism is a scale-out deployment method that first assigns preprocessing tasks of different modalities to different edge devices, then delivers them to a device for modality feature fusion, and finally distributes the fused features to other devices responsible for different task head computations. We run experiments on BEVFusion network and achieve an approximately 30% reduction in latency using two Jetson Orin devices connected by Remote Direct Memory Access (RDMA). Furthermore, we conduct a series of simulation experiments to cover scale-out scenarios and also achieve a good level of latency reduction. We hope that our proposed method can provide valuable experience for the optimized scale-out deployment of large multi-modal DNNs with multi-task heads on multiple edge devices.
What problem does this paper attempt to address?