Efficient CUDA stream management for multi-DNN real-time inference on embedded GPUs

Weiguang Pang,Xiantong Luo,Kailun Chen,Dong Ji,Lei Qiao,Wang Yi
DOI: https://doi.org/10.1016/j.sysarc.2023.102888
IF: 5.836
2023-04-28
Journal of Systems Architecture
Abstract:Deep Neural Networks (DNNs) are widely used in Cyber-Physical Systems (CPS) that often involve multiple DNN tasks with varying real-time requirements. These tasks need to be deployed on a single embedded hardware platform with limited resources, such as an embedded GPU. Efficiently sharing the same embedded GPU among multiple real-time DNN tasks is a complex challenge. While existing DNN frameworks (e.g., PyTorch and TensorFlow) focus on maximizing average performance and high throughput on GPU, they lack scheduling management mechanisms considering multiple DNNs with different timing requirements. In this paper, we address this challenge by thoroughly examining and summarizing the scheduling rules for multiple kernels with different priorities in CUDA streams. Based on these rules, we design a framework that supports multi-DNN real-time inference and propose a method for allocating CUDA streams to DNN kernels to meet schedulability requirements while maximizing GPU resource utilization. Our proposed approach is implemented on an NVIDIA Jetson AGX Xavier embedded GPU system and validated using several popular DNNs. The results show that our approach achieves shorter response times compared with several state-of-the-art methods.
computer science, software engineering, hardware & architecture
What problem does this paper attempt to address?