TensorRT-based Framework and Optimization Methodology for Deep Learning Inference on Jetson Boards

Eunjin Jeong,Jangryul Kim,Soonhoi Ha,EunJin Jeong
DOI: https://doi.org/10.1145/3508391
2022-01-26
ACM Transactions on Embedded Computing Systems
Abstract:As deep learning inference applications are increasing in embedded devices, an embedded device tends to equip neural processing units (NPUs) in addition to a multi-core CPU and a GPU, and NVIDIA Jetson AGX Xavier is an example. For fast and efficient development of deep learning applications, TensorRT is provided as the SDK for high-performance inference, including optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications. Like most deep learning frameworks, TensorRT assumes that the inference is executed on a single processing element, GPU or NPU, not both. In this paper, we present a TensorRT-based framework supporting various optimization parameters to accelerate a deep learning application targeted on NVIDIA Jetson embedded platform with heterogeneous processors including multi-threading, pipelining, buffer assignment, and network duplication. Since the design space of allocating layers to diverse processing elements and optimizing other parameters is huge, we devise a parameter optimization methodology that consists of a heuristic for balancing pipeline stages among heterogeneous processors and fine-tuning process for optimizing parameters. With nine real-life benchmarks, we could achieve 101% ~ 680% performance improvement and up to 55% energy reduction over the baseline inference using GPU only.
computer science, software engineering, hardware & architecture
What problem does this paper attempt to address?