Real-Time Monocular Depth Estimation Merging Vision Transformers on Edge Devices for AIoT

Xihao Liu,Wei Wei,Cheng Liu,Yuyang Peng,Jinhao Huang,Jun Li
DOI: https://doi.org/10.1109/tim.2023.3264039
IF: 5.6
2023-04-14
IEEE Transactions on Instrumentation and Measurement
Abstract:Depth estimation is requisite to build the 3-D perceiving capability of artificial intelligence of things (AIoT). Real-time inference with extremely low computing resource consumption is critical on edge devices. However, most single-view depth estimation networks focus on the improvement of accuracy when running on high-end GPUs, which goes against the real-time requirement on edge devices. To address this issue, this article proposed a novel encoder–decoder network to realize real-time monocular depth estimation on edge devices. The proposed network merges semantic information at the global field via an efficient transformer-based module to provide more details of the object for depth assignment. The transformer-based module is integrated into the lowest level resolution of an encoder–decoder architecture to largely reduce the parameters of the vision transformer (ViT). In particular, we proposed a novel patch convolutional layer for low-latency feature extraction in the encoder and an SConv5 layer for effective depth assignment in the decoder. The proposed network achieves an outstanding balance between the accuracy and speed of the NYU Depth v2 dataset. A low root mean square error (RMSE) of 0.554 and a fast speed of 58.98 FPS on NVIDIA Jetson Nano device with TensorRT optimization are obtained on NYU Depth v2, outperforming most state-of-the-art real-time results.
engineering, electrical & electronic,instruments & instrumentation
What problem does this paper attempt to address?