MD-ST: Monocular Depth Estimation Based on Spatio-Temporal Correlation Features

Xuyang Meng,Chunxiao Fan,Yue Ming,Runqing Zhang,Panzi Zhao
DOI: https://doi.org/10.1007/978-3-030-60633-6_29
2020-01-01
Abstract:Monocular depth estimation methods based on deep learning have shown very promising results recently, most of which exploit deep convolutional neural networks (CNNs) with scene geometric constraints. However, the depth maps estimated by most existing methods still have problems such as unclear object contours and unsmooth depth gradients. In this paper, we propose a novel encoder-decoder network, named Monocular Depth estimation with Spatio-Temporal features (MD-ST), based on recurrent convolutional neural networks for monocular video depth estimation with spatio-temporal correlation features. Specifically, we put forward a novel encoder with convolutional long short-term memory (Conv-LSTM) structure for monocular depth estimation, which not only captures the spatial features of the scene but also focuses on collecting the temporal features from video sequences. In decoder, we learn four scales depth maps for multi-scale estimation to fine-tune the outputs. Additionally, in order to enhance and maintain the spatio-temporal consistency, we constraint our network with a flow consistency loss to penalize the errors between the estimated and ground-truth maps by learning residual flow vectors. Experiments conducted on the KITTI dataset demonstrate that the proposed MD-ST can effectively estimate scene depth maps, especially in dynamic scenes, which is superior to existing monocular depth estimation methods.
What problem does this paper attempt to address?