Learning Effective Geometry Representation from Videos for Self-Supervised Monocular Depth Estimation

Hailiang Zhao,Yongyi Kong,Chonghao Zhang,Haoji Zhang,Jiansen Zhao
DOI: https://doi.org/10.3390/ijgi13060193
IF: 3.4
2024-06-12
ISPRS International Journal of Geo-Information
Abstract:Recent studies on self-supervised monocular depth estimation have achieved promising results, which are mainly based on the joint optimization of depth and pose estimation via high-level photometric loss. However, how to learn the latent and beneficial task-specific geometry representation from videos is still far from being explored. To tackle this issue, we propose two novel schemes to learn more effective representation from monocular videos: (i) an Inter-task Attention Model (IAM) to learn the geometric correlation representation between the depth and pose learning networks to make structure and motion information mutually beneficial; (ii) a Spatial-Temporal Memory Module (STMM) to exploit long-range geometric context representation among consecutive frames both spatially and temporally. Systematic ablation studies are conducted to demonstrate the effectiveness of each component. Evaluations on KITTI show that our method outperforms current state-of-the-art techniques.
geography, physical,remote sensing,computer science, information systems
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to learn effective task - specific geometric representations from videos in self - supervised monocular depth estimation. Specifically, existing self - supervised monocular depth estimation methods mainly jointly optimize depth and pose estimation through high - level photometric losses, but how to learn latent and beneficial task - specific geometric representations from videos has not been fully explored. To address this challenge, the authors propose two novel methods: 1. **Inter - task Attention Model (IAM)**: By learning geometrically correlated representations between the depth and pose learning networks, the structure and motion information can benefit from each other. 2. **Spatial - Temporal Memory Module (STMM)**: Utilize the temporal and spatial geometric context representations between consecutive frames to improve the accuracy of depth estimation. ### Background of the Paper Understanding the 3D structure of a scene is an important topic in machine perception and is widely used in fields such as autonomous driving, robotic vision, and virtual reality. However, obtaining effective task - specific geometric representations from videos to obtain more accurate and reliable depth information is a key challenge. Although existing self - supervised monocular depth estimation methods have made significant progress by jointly optimizing depth and pose estimation through high - level photometric losses, they still have deficiencies in utilizing geometric information in videos. ### Main Contributions 1. **Inter - task Attention Model (IAM)**: A cross - task attention mechanism is proposed. Attention maps are generated through depth information to guide the pose network to identify key regions, thereby improving the accuracy of pose estimation. This is the first attempt to utilize cross - task geometric correlations in self - supervised monocular depth estimation. 2. **Spatial - Temporal Memory Module (STMM)**: A spatial - temporal memory module is introduced. By using the spatio - temporal geometric context between consecutive frames, historical information is effectively utilized to improve depth estimation results. 3. **Experimental Verification**: Comprehensive experimental verification was carried out on the KITTI dataset. The results show that this method has a relative gain of 6.6% over existing methods on the main evaluation metrics. ### Method Overview - **Problem Definition**: The typical pipeline of self - supervised monocular depth estimation is based on perspective projection between consecutive frames. By obtaining the depth and camera transformation, the source frame can be re - projected onto the target frame and calculated by the differentiable bilinear sampling method. - **Network Architecture**: The network consists of two main parts, which are used for depth estimation and pose estimation respectively. The pose network is divided into two branches, predicting rotation and translation respectively. IAM is used to learn geometrically correlated representations between depth and pose tasks, while STMM is used to utilize long - range geometric correlations between consecutive frames. - **Cross - task Attention Module (IAM)**: Depth features are processed through average pooling and max - pooling layers to generate attention maps, guiding the pose network to focus on key regions. - **Spatial - Temporal Memory Module (STMM)**: The non - local network is utilized to capture long - range context information and enhance the estimation ability of distant objects. ### Experimental Results - **Depth Estimation Results**: The experimental results on the KITTI dataset show that this method outperforms existing self - supervised methods in single - frame inference, especially when dealing with moving objects, distant objects, and fine structures. - **Generalization Ability**: Even if it is only trained on the KITTI dataset, this method also shows good generalization ability on unseen datasets such as Make3D and Cityscapes. - **Ablation Study**: Ablation studies were carried out by removing specific components, verifying the effectiveness of IAM and STMM, especially in the estimation of distant objects. In conclusion, by introducing IAM and STMM, this paper effectively addresses the challenge of learning task - specific geometric representations from videos and significantly improves the performance of self - supervised monocular depth estimation.