Self-Supervised Multi-Frame Monocular Depth Estimation for Dynamic Scenes

Guanghui Wu,Hao Liu,Longguang Wang,Kunhong Li,Yulan Guo,Zengping Chen
DOI: https://doi.org/10.1109/tcsvt.2023.3340948
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Self-supervised multi-frame depth estimation outperforms single-frame approaches by utilizing not only appearance information, but also geometric information. A common practice for multi-frame methods is to employ feature-metric bundle adjustment (FBA) to refine depth map initialized from the single-frame prior. However, FBA cannot always provide effective residual updates due to unreliable matching costs, which are corrupted by thin texture, occlusion, and especially object motion. To tackle this problem, we propose a context-aware transformer (CAT) to refine the corrupted matching costs by leveraging the spatial context information. Specifically, the CAT adaptively aggregates matching costs according to the spatial affinity inferred from local appearance context, and produces reliable contextual costs for FBA. Moreover, we design a motion-aware regularization loss to provide supervision for regions with moving objects, making CAT competent for dynamic scenes. Extensive experiments and analyses on the KITTI and Cityscapes datasets demonstrate the effectiveness and superior generalization capability of our approach.
What problem does this paper attempt to address?