Abstract:Recent studies on self-supervised monocular depth estimation have achieved promising results, which are mainly based on the joint optimization of depth and pose estimation via high-level photometric loss. However, how to learn the latent and beneficial task-specific geometry representation from videos is still far from being explored. To tackle this issue, we propose two novel schemes to learn more effective representation from monocular videos: (i) an Inter-task Attention Model (IAM) to learn the geometric correlation representation between the depth and pose learning networks to make structure and motion information mutually beneficial; (ii) a Spatial-Temporal Memory Module (STMM) to exploit long-range geometric context representation among consecutive frames both spatially and temporally. Systematic ablation studies are conducted to demonstrate the effectiveness of each component. Evaluations on KITTI show that our method outperforms current state-of-the-art techniques.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to learn effective task - specific geometric representations from videos in self - supervised monocular depth estimation. Specifically, existing self - supervised monocular depth estimation methods mainly jointly optimize depth and pose estimation through high - level photometric losses, but how to learn latent and beneficial task - specific geometric representations from videos has not been fully explored. To address this challenge, the authors propose two novel methods: 1. **Inter - task Attention Model (IAM)**: By learning geometrically correlated representations between the depth and pose learning networks, the structure and motion information can benefit from each other. 2. **Spatial - Temporal Memory Module (STMM)**: Utilize the temporal and spatial geometric context representations between consecutive frames to improve the accuracy of depth estimation. ### Background of the Paper Understanding the 3D structure of a scene is an important topic in machine perception and is widely used in fields such as autonomous driving, robotic vision, and virtual reality. However, obtaining effective task - specific geometric representations from videos to obtain more accurate and reliable depth information is a key challenge. Although existing self - supervised monocular depth estimation methods have made significant progress by jointly optimizing depth and pose estimation through high - level photometric losses, they still have deficiencies in utilizing geometric information in videos. ### Main Contributions 1. **Inter - task Attention Model (IAM)**: A cross - task attention mechanism is proposed. Attention maps are generated through depth information to guide the pose network to identify key regions, thereby improving the accuracy of pose estimation. This is the first attempt to utilize cross - task geometric correlations in self - supervised monocular depth estimation. 2. **Spatial - Temporal Memory Module (STMM)**: A spatial - temporal memory module is introduced. By using the spatio - temporal geometric context between consecutive frames, historical information is effectively utilized to improve depth estimation results. 3. **Experimental Verification**: Comprehensive experimental verification was carried out on the KITTI dataset. The results show that this method has a relative gain of 6.6% over existing methods on the main evaluation metrics. ### Method Overview - **Problem Definition**: The typical pipeline of self - supervised monocular depth estimation is based on perspective projection between consecutive frames. By obtaining the depth and camera transformation, the source frame can be re - projected onto the target frame and calculated by the differentiable bilinear sampling method. - **Network Architecture**: The network consists of two main parts, which are used for depth estimation and pose estimation respectively. The pose network is divided into two branches, predicting rotation and translation respectively. IAM is used to learn geometrically correlated representations between depth and pose tasks, while STMM is used to utilize long - range geometric correlations between consecutive frames. - **Cross - task Attention Module (IAM)**: Depth features are processed through average pooling and max - pooling layers to generate attention maps, guiding the pose network to focus on key regions. - **Spatial - Temporal Memory Module (STMM)**: The non - local network is utilized to capture long - range context information and enhance the estimation ability of distant objects. ### Experimental Results - **Depth Estimation Results**: The experimental results on the KITTI dataset show that this method outperforms existing self - supervised methods in single - frame inference, especially when dealing with moving objects, distant objects, and fine structures. - **Generalization Ability**: Even if it is only trained on the KITTI dataset, this method also shows good generalization ability on unseen datasets such as Make3D and Cityscapes. - **Ablation Study**: Ablation studies were carried out by removing specific components, verifying the effectiveness of IAM and STMM, especially in the estimation of distant objects. In conclusion, by introducing IAM and STMM, this paper effectively addresses the challenge of learning task - specific geometric representations from videos and significantly improves the performance of self - supervised monocular depth estimation.

Learning Effective Geometry Representation from Videos for Self-Supervised Monocular Depth Estimation

Monocular Depth Estimation Based on Unsupervised Learning

Every Pixel Counts: Unsupervised Geometry Learning with Holistic 3D Motion Understanding

Self-supervised Learning of Monocular 3D Geometry Understanding with Two- and Three-View Geometric Constraints

3D Hierarchical Refinement and Augmentation for Unsupervised Learning of Depth and Pose From Monocular Video

Unsupervised Joint Learning of Depth, Optical Flow, Ego-motion from Video

Geometry-Aware Network for Unsupervised Learning of Monocular Camera's Ego-Motion

Unsupervised Scale-consistent Depth and Ego-motion Learning from Monocular Video

Learning Geometry-Guided Depth via Projective Modeling for Monocular 3D Object Detection

GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose

Temporal-Aware SfM-Learner: Unsupervised Learning Monocular Depth and Motion from Stereo Video Clips.

Unsupervised Learning of Depth and Ego-Motion with Spatial-Temporal Geometric Constraints

Semantically-Guided Representation Learning for Self-Supervised Monocular Depth

Self-Supervised 3D Reconstruction and Ego-Motion Estimation Via On-Board Monocular Video

Self-supervised multi-frame depth estimation with visual-inertial pose transformer and monocular guidance

Adversarial Learning for Joint Optimization of Depth and Ego-Motion

Weakly-Supervised Discovery of Geometry-Aware Representation for 3D Human Pose Estimation

Unsupervised Learning of Depth, Optical Flow and Pose With Occlusion From 3D Geometry

Exploiting Temporal Consistency for Real-Time Video Depth Estimation

Robust Geometry-Preserving Depth Estimation Using Differentiable Rendering

Collaborative Learning of Depth Estimation, Visual Odometry and Camera Relocalization from Monocular Videos.