Abstract:Self-supervised monocular depth estimation has seen significant progress in recent years, especially in outdoor environments, i.e., autonomous driving scenes. However, depth prediction results are not satisfying in indoor scenes where most of the existing data are captured with hand-held devices. As compared to outdoor environments, estimating depth of monocular videos for indoor environments, using self-supervised methods, results in two additional challenges: (i) the depth range of indoor video sequences varies a lot across different frames, making it difficult for the depth network to induce consistent depth cues for training, whereas the maximum distance in outdoor scenes mostly stays the same as the camera usually sees the sky; (ii) the indoor sequences recorded with handheld devices often contain much more rotational motions, which cause difficulties for the pose network to predict accurate relative camera poses, while the motions of outdoor sequences are pre-dominantly translational, especially for street-scene driving datasets such as KITTI. In this work, we propose a novel framework-MonoIndoor++ by giving special considerations to those challenges and consolidating a set of good practices for improving the performance of self-supervised monocular depth estimation for indoor environments. First, a depth factorization module with transformer-based scale regression network is proposed to estimate a global depth scale factor explicitly, and the predicted scale factor can indicate the maximum depth values. Second, rather than using a single-stage pose estimation strategy as in previous methods, we propose to utilize a residual pose estimation module to estimate relative camera poses across consecutive frames iteratively. Third, to incorporate extensive coordinates guidance for our residual pose estimation module, we propose to perform coordinate convolutional encoding directly over the inputs to pose networks. The proposed method is validated on a variety of benchmark indoor datasets, i.e., EuRoC MAV, NYUv2, ScanNet and 7-Scenes, demonstrating the state-of-the-art performance. In addition, the effectiveness of each module is shown through a carefully conducted ablation study and the good generalization and universality of our trained model is also demonstrated, specifically on ScanNet and 7-Scenes datasets.

IterDepth: Iterative Residual Refinement for Outdoor Self-Supervised Multi-Frame Monocular Depth Estimation

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

Monocular Depth Estimation Based on Unsupervised Learning

Self-supervised multi-frame depth estimation with visual-inertial pose transformer and monocular guidance

Crafting Monocular Cues and Velocity Guidance for Self-Supervised Multi-Frame Depth Learning

Self-Supervised Multi-Frame Monocular Depth Estimation for Dynamic Scenes

An Adaptive Unsupervised Learning Framework For Monocular Depth Estimation

FusionDepth: Complement Self-Supervised Monocular Depth Estimation with Cost Volume

Manydepth2: Motion-Aware Self-Supervised Multi-Frame Monocular Depth Estimation in Dynamic Scenes

Self-Supervised Monocular Depth Estimation Based on High-Order Spatial Interactions

MBUDepthNet: Real-Time Unsupervised Monocular Depth Estimation Method for Outdoor Scenes

Self-supervised Monocular Depth Estimation with Multi-Scale Structure Similarity Loss

Unsupervised Framework for Depth Estimation and Camera Motion Prediction from Video.

MonoIndoor++: Towards Better Practice of Self-Supervised Monocular Depth Estimation for Indoor Environments

Unsupervised detail-preserving network for high quality monocular depth estimation

Disentangling Object Motion and Occlusion for Unsupervised Multi-frame Monocular Depth

Exploring the Mutual Influence between Self-Supervised Single-Frame and Multi-Frame Depth Estimation

TSUDepth: Exploring Temporal Symmetry-Based Uncertainty for Unsupervised Monocular Depth Estimation

A Self-Supervised Monocular Depth Estimation Method Based on High Resolution Convolutional Neural Network

Towards Scale-Aware Self-Supervised Multi-Frame Depth Estimation with IMU Motion Dynamics.

Self-Supervised Monocular Depth Estimation with Self-Reference Distillation and Disparity Offset Refinement