Abstract:Monocular depth estimation (MDE) predicts pixel-level depth from a single image and plays a vital role in image sensing. MDE has made progress due to the usage of deep neural networks (DNNs). However, current MDE methods fail to provide satisfactory depth, as they rarely modeled dependencies among convolution channels and ignored location relationships in DNNs. Additionally, they are commonly slow for inference on embedded devices due to high computation complexity. To tackle these problems, we propose a novel encoder–decoder network (EDNet) for fast MDE inference on diverse embedded devices. Specifically, (1) we design an encoder to re-explore new features, and then model nonlinear and dynamic dependencies among convolution channels based on an attention mechanism. (2) We propose a decoder containing four plug-and-play blocks to individually extract image features, model dependencies among convolution channels, learn location relationships, and adjust the channels. (3) We optimize EDNet with inference engines to match MDE with different embedded system architectures. Experiments confirm that our RMSE (root mean square error) is at least lower by 3.7% and 5.0% than that of state-of-the-art models on the NYU-Depth-v2 and KITTI datasets, respectively. The optimized EDNet simultaneously improves the accuracy, inference speed, and visualized results of MDE on different embedded devices.

Deep Neural Networks with Attention Mechanism for Monocular Depth Estimation on Embedded Devices