Aaron: Compile-time Kernel Adaptation for Multi-DNN Inference Acceleration on Edge GPU

Zhihe Zhao,Neiwen Ling,Nan Guan,Guoliang Xing
DOI: https://doi.org/10.1145/3560905.3568050
2022-01-01
Abstract:AI applications powered by deep learning are increasingly running on edge devices. Meanwhile, many real-world IoT applications demand multiple real-time tasks to run on the same device, for example, to achieve both object tracking and image segmentation simultaneously on an augmented reality glass. However, the current solutions can not yet support such multi-tenant real-time DNN inference on edge devices. Techniques such as on-device model compression trade inference accuracy for speed, while traditional DNN compilers mainly focus on single-tenant DNN model optimization. To fill this gap, we propose Aaron, which leverages DNN compiling techniques to accelerate multi-DNN inference on edge GPU based on compile-time kernel adaptation with no accuracy loss. Aaron integrates both DNN graph and kernel optimization to maximize on-device parallelism and minimize contention brought by concurrent inference.
What problem does this paper attempt to address?