Abstract:Self-supervised monocular depth estimation has seen significant progress in recent years, especially in outdoor environments, i.e., autonomous driving scenes. However, depth prediction results are not satisfying in indoor scenes where most of the existing data are captured with hand-held devices. As compared to outdoor environments, estimating depth of monocular videos for indoor environments, using self-supervised methods, results in two additional challenges: (i) the depth range of indoor video sequences varies a lot across different frames, making it difficult for the depth network to induce consistent depth cues for training, whereas the maximum distance in outdoor scenes mostly stays the same as the camera usually sees the sky; (ii) the indoor sequences recorded with handheld devices often contain much more rotational motions, which cause difficulties for the pose network to predict accurate relative camera poses, while the motions of outdoor sequences are pre-dominantly translational, especially for street-scene driving datasets such as KITTI. In this work, we propose a novel framework-MonoIndoor++ by giving special considerations to those challenges and consolidating a set of good practices for improving the performance of self-supervised monocular depth estimation for indoor environments. First, a depth factorization module with transformer-based scale regression network is proposed to estimate a global depth scale factor explicitly, and the predicted scale factor can indicate the maximum depth values. Second, rather than using a single-stage pose estimation strategy as in previous methods, we propose to utilize a residual pose estimation module to estimate relative camera poses across consecutive frames iteratively. Third, to incorporate extensive coordinates guidance for our residual pose estimation module, we propose to perform coordinate convolutional encoding directly over the inputs to pose networks. The proposed method is validated on a variety of benchmark indoor datasets, i.e., EuRoC MAV, NYUv2, ScanNet and 7-Scenes, demonstrating the state-of-the-art performance. In addition, the effectiveness of each module is shown through a carefully conducted ablation study and the good generalization and universality of our trained model is also demonstrated, specifically on ScanNet and 7-Scenes datasets.

Using Full-Scale Feature Fusion for Self-Supervised Indoor Depth Estimation

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

MFF-Net: Towards Efficient Monocular Depth Completion With Multi-Modal Feature Fusion

Self-supervised Monocular Depth Estimation with Multi-Scale Feature Fusion

FS-Depth: Focal-and-Scale Depth Estimation from a Single Image in Unseen Indoor Scene

Self-supervised Depth Estimation with High Resolution Features and Non-local Information.

Deep feature fusion for self-supervised monocular depth prediction

$\mathrm{F^2Depth}$: Self-supervised Indoor Monocular Depth Estimation via Optical Flow Consistency and Feature Map Synthesis

Super-Resolution for Monocular Depth Estimation with Multi-Scale Sub-Pixel Convolutions and a Smoothness Constraint.

Deeper into Self-Supervised Monocular Indoor Depth Estimation

MonoIndoor++: Towards Better Practice of Self-Supervised Monocular Depth Estimation for Indoor Environments

Towards Scale-Aware Self-Supervised Multi-Frame Depth Estimation with IMU Motion Dynamics.

Resolution-sensitive self-supervised monocular absolute depth estimation

Multi-resolution Monocular Depth Map Fusion by Self-supervised Gradient-based Composition

Monocular depth estimation with hierarchical fusion of dilated CNNs and soft-weighted-sum inference

Self-supervised Monocular Depth Estimation with Uncertainty-aware Feature Enhancement and Depth Fusion

StructDepth: Leveraging the Structural Regularities for Self-Supervised Indoor Depth Estimation

Self-supervised Monocular Depth Estimation with Self-Distillation and Dense Skip Connection

SIM-MultiDepth: Self-Supervised Indoor Monocular Multi-Frame Depth Estimation Based on Texture-Aware Masking

A Dual Encoder–Decoder Network for Self-Supervised Monocular Depth Estimation

RCFNet: Related Cross-level Feature Network with Cascaded Self-distillation for Monocular Depth Estimation