Abstract:Estimating scene depth from a single image can be widely applied to understand 3D environments due to the easy access of the images captured by consumer-level cameras. Previous works exploit conditional random fields (CRFs) to estimate image depth, where neighboring pixels (superpixels) with similar appearances are constrained to share the same depth. However, the depth may vary significantly in the slanted surface, thus leading to severe estimation errors. In order to eliminate those errors, we propose a superpixel-based normal guided scale invariant deep convolutional field by encouraging the neighboring superpixels with similar appearance to lie on the same 3D plane of the scene. In doing so, a depth-normal multitask CNN is introduced to produce the superpixel-wise depth and surface normal predictions simultaneously. To correct the errors of the roughly estimated superpiexl-wise depth, we develop a normal guided scale invariant CRF (NGSI-CRF). NGSI-CRF consists of a scale invariant unary potential that is able to measure the relative depth between superpixels as well as the absolute depth of superpixels, and a normal guided pairwise potential that constrains spatial relationships between superpixels in accordance with the 3D layout of the scene. In other words, the normal guided pairwise potential is designed to smooth the depth prediction without deteriorating the 3D structure of the depth prediction. The superpixel-wise depth maps estimated by NGSI-CRF will be fed into a pixel-wise refinement module to produce a smooth fine-grained depth prediction. Furthermore, we derive a closed-form solution for the maximum a posteriori (MAP) inference of NGSI-CRF. Thus, our proposed network can be efficiently trained in an end-to-end manner. We conduct our experiments on various datasets, such as NYU-D2, KITTI, and Make 3D. As demonstrated in the experimental results, our method achieves superior performance in both indoor and outdoor scenes.

MMAIndoor: Patched MLP and Multi-dimensional Cross Attention Based Self-supervised Indoor Depth Estimation

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

SIM-MultiDepth: Self-Supervised Indoor Monocular Multi-Frame Depth Estimation Based on Texture-Aware Masking

PMIndoor: Pose Rectified Network and Multiple Loss Functions for Self-Supervised Monocular Indoor Depth Estimation

GAM-Depth: Self-Supervised Indoor Depth Estimation Leveraging a Gradient-Aware Mask and Semantic Constraints

Deeper into Self-Supervised Monocular Indoor Depth Estimation

MonoIndoor: Towards Good Practice of Self-Supervised Monocular Depth Estimation for Indoor Environments

Single Image Depth Estimation with Normal Guided Scale Invariant Deep Convolutional Fields

FS-Depth: Focal-and-Scale Depth Estimation from a Single Image in Unseen Indoor Scene

Depth Insight -- Contribution of Different Features to Indoor Single-image Depth Estimation

Multi‐view stereo for weakly textured indoor 3D reconstruction

DCL-depth: monocular depth estimation network based on iam and depth consistency loss

MDSNet: self-supervised monocular depth estimation for video sequences using self-attention and threshold mask

A Two-Stage Masked Autoencoder Based Network for Indoor Depth Completion

Bridging the Gap Between Indoor Depth Completion and Masked Autoencoders

Iterative Feature Matching for Self-Supervised Indoor Depth Estimation

Indoor Scene Classification by Incorporating Predicted Depth Descriptor.

Unsupervised Monocular Estimation of Depth and Visual Odometry uUsing Attention and Depth-Pose Consistency Loss

Depth Information Calculation Method for Unstructured Objects Based on Deep Neural Network

MBUDepthNet: Real-Time Unsupervised Monocular Depth Estimation Method for Outdoor Scenes

Indoor Scene Reconstruction From Monocular Video Combining Contextual and Geometric Priors