Abstract:Self-supervised monocular depth estimation has seen significant progress in recent years, especially in outdoor environments, i.e., autonomous driving scenes. However, depth prediction results are not satisfying in indoor scenes where most of the existing data are captured with hand-held devices. As compared to outdoor environments, estimating depth of monocular videos for indoor environments, using self-supervised methods, results in two additional challenges: (i) the depth range of indoor video sequences varies a lot across different frames, making it difficult for the depth network to induce consistent depth cues for training, whereas the maximum distance in outdoor scenes mostly stays the same as the camera usually sees the sky; (ii) the indoor sequences recorded with handheld devices often contain much more rotational motions, which cause difficulties for the pose network to predict accurate relative camera poses, while the motions of outdoor sequences are pre-dominantly translational, especially for street-scene driving datasets such as KITTI. In this work, we propose a novel framework-MonoIndoor++ by giving special considerations to those challenges and consolidating a set of good practices for improving the performance of self-supervised monocular depth estimation for indoor environments. First, a depth factorization module with transformer-based scale regression network is proposed to estimate a global depth scale factor explicitly, and the predicted scale factor can indicate the maximum depth values. Second, rather than using a single-stage pose estimation strategy as in previous methods, we propose to utilize a residual pose estimation module to estimate relative camera poses across consecutive frames iteratively. Third, to incorporate extensive coordinates guidance for our residual pose estimation module, we propose to perform coordinate convolutional encoding directly over the inputs to pose networks. The proposed method is validated on a variety of benchmark indoor datasets, i.e., EuRoC MAV, NYUv2, ScanNet and 7-Scenes, demonstrating the state-of-the-art performance. In addition, the effectiveness of each module is shown through a carefully conducted ablation study and the good generalization and universality of our trained model is also demonstrated, specifically on ScanNet and 7-Scenes datasets.

MonoIndoor++: Towards Better Practice of Self-Supervised Monocular Depth Estimation for Indoor Environments

MonoIndoor: Towards Good Practice of Self-Supervised Monocular Depth Estimation for Indoor Environments

Monocular Depth Estimation Based on Unsupervised Learning

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

Deeper into Self-Supervised Monocular Indoor Depth Estimation

GasMono: Geometry-Aided Self-Supervised Monocular Depth Estimation for Indoor Scenes

RealMonoDepth: Self-Supervised Monocular Depth Estimation for General Scenes

PMIndoor: Pose Rectified Network and Multiple Loss Functions for Self-Supervised Monocular Indoor Depth Estimation

SIM-MultiDepth: Self-Supervised Indoor Monocular Multi-Frame Depth Estimation Based on Texture-Aware Masking

StructDepth: Leveraging the Structural Regularities for Self-Supervised Indoor Depth Estimation

IterDepth: Iterative Residual Refinement for Outdoor Self-Supervised Multi-Frame Monocular Depth Estimation

Unsupervised Learning of Monocular Depth and Ego-motion in Outdoor/Indoor Environments

A Monocular Depth Estimation Method for Indoor-Outdoor Scenes Based on Vision Transformer

Self-supervised multi-frame depth estimation with visual-inertial pose transformer and monocular guidance

FS-Depth: Focal-and-Scale Depth Estimation from a Single Image in Unseen Indoor Scene

Indoor Scene Reconstruction From Monocular Video Combining Contextual and Geometric Priors

MBUDepthNet: Real-Time Unsupervised Monocular Depth Estimation Method for Outdoor Scenes

Unsupervised Framework for Depth Estimation and Camera Motion Prediction from Video.

MonoER - A Edge Refined Self-Supervised Monocular Depth Estimation Method

HI-Net: Boosting Self-Supervised Indoor Depth Estimation Via Pose Optimization

Resolution-sensitive self-supervised monocular absolute depth estimation