IterDepth: Iterative Residual Refinement for Outdoor Self-Supervised Multi-Frame Monocular Depth Estimation

Cheng Feng,Zhen Chen,Congxuan Zhang,Weiming Hu,Bing Li,Feng Lu
DOI: https://doi.org/10.1109/tcsvt.2023.3284479
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Self-supervised monocular depth estimation has been a challenging task in computer vision for a long time, and it relies on only monocular or stereo video for its supervision. To address the challenge, we propose a novel multi-frame monocular depth estimation method called IterDepth, which is based on an iterative residual refinement network. IterDepth extracts depth features from consecutive frames and computes a 3D cost volume measuring the difference between current and previous features transformed by PoseCNN (pose estimation convolutional neural network). We reformulate depth prediction as a residual learning problem, revamping the dominating depth regression to enable high-accuracy multi-frame monocular depth estimation. Specifically, we design a gated recurrent depth fusion unit that seamlessly blends depth features from the cost volume, image features, and the depth prediction. The unit updates the hidden states and refines the depth map through iterative refinement, achieving more accurate predictions than existing methods. Our experiments on the KITTI dataset demonstrate that IterDepth is $7\times $ faster in terms of FPS (frames per second) than the recent state-of-the-art DepthFormer model with competitive performance. We also test IterDepth on the Cityscapes dataset to showcase its generalization capability in other real-world environments. Moreover, IterDepth can balance accuracy and computational efficiency by adjusting the number of refinement iterations and performs competitively with other CNN-based monocular depth estimation approaches. Source code is available at https://github.com/PCwenyue/IterDepth-TCSVT .
What problem does this paper attempt to address?