Abstract:Objective Obtaining scene depth is crucial in 3D reconstruction, autonomous driving, and other related tasks. Current methods based on lidar or time of flight (ToF) cameras are not widely applicable due to their high cost. In contrast, only employing a single RGB image to infer scene depth information is more cost- effective, which has broader potential for more applications. Inspired by the successful applications of deep learning methods in various ill-posed problems recently, many researchers tend to adopt convolutional neural networks to estimate reasonable and accurate monocular depths. However, most existing studies based on deep learning focus on how to enhance the feature extraction capability of the network, without attention paid to the distribution of image depths. Estimating the pixel distributions of images can not only improve the inference precision but also make the reconstructed 3D images more consistent with ground truth. Therefore, we propose a new adaptive depth distribution module, which allows the model to predict different depth distributions for each image during the training. Methods The NYU Depth-v2 dataset created by New York University is employed. Overall, our model is built based on the encoder-decoder structure with skip connections, which has been proven to be able to guide image generation more effectively. An indirect representation of depth maps based on plane coefficient is also introduced to implicitly add the plane constraint in the depth estimation and obtain smoother depth estimation results in the plane region of the scene. Specifically, two sub-networks with different lightweight designs are adopted at the bottleneck and other upsampling stages in the network to enhance the model's feature extraction capability. In addition, an adaptive depth distribution estimation module is also designed to estimate different depth distributions according to different input images, which makes the pixel distribution of depth maps closer to the ground truth. A two-stage training strategy is employed. In the first stage, we load the pretrained weights on ImageNet into the backbone network and optimize the model using the loss function only at the 2D level. In the second stage, we perform joint training through loss functions at both the 2D and 3D levels. Results and Discussions Our study employs multiple metrics including root mean square error (RMSE), relative error (REL), and intersection over union (IoU) to qualitatively evaluate the inference ability of the proposed model. As shown in Table 1, the proposed lightweight network model outperforms most of the listed methods with only 46 M parameters, which proves the overall structure of the model is concise and effective. The visual comparison results of 3D depth reconstruction (Fig. 5) demonstrate that the proposed network can output smoother and more continuous depth predictions in planar regions, and reasonable predictions in the partially occluded or missing areas of planar regions. In terms of depth distribution, the carefully designed adaptive depth distribution module can make the predicted distribution fit better with the ground truth in the trend of the curve and can get a higher IoU rate compared with other methods (Fig. 6 and Table 3), thus indicating the effectiveness of the proposed module. Additionally, the lightweight network can balance accuracy and speed in real-time scenarios (Table 2), and yield good inference and reconstruction results. However, the proposed network has some limitations in recovering fine details of the depth predictions (Fig. 7), and thus how to design the network to recover more depth details while ensuring the model's real-time prediction performance will be the focus of our future work. Conclusions An innovative model based on plane coefficient representation with adaptive depth distribution for monocular image depth estimation tasks is presented. Qualitative and quantitative results obtained from the NYU Depth-v2 dataset and multiple comparative experiments demonstrate that the proposed method is capable of obtaining reasonable prediction results for planar regions in images with partial occlusions or small viewing angles. Additionally, the proposed depth distribution prediction module provides differentiated pixel distribution optimization for each image, which can make the model achieve pixel depth distribution prediction results closer to the real images. With its lightweight design, this method realizes a balance between inference speed and inference accuracy and is highly applicable in practical scenarios that require accuracy in real time, such as indoor virtual reality and human-computer interaction.

RADepthNet: Reflectance-Aware Monocular Depth Estimation

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

A Robust Monocular Depth Estimation Framework Based on Light-Weight ERF-Pspnet for Day-Night Driving Scenes

Boundary-induced and scene-aggregated network for monocular depth prediction

Depth Cue Enhancement and Guidance Network for RGB-D Salient Object Detection

Region Deformer Networks for Unsupervised Depth Estimation from Unconstrained Monocular Videos

Depth Monocular Estimation with Attention-based Encoder-Decoder Network from Single Image

MBUDepthNet: Real-Time Unsupervised Monocular Depth Estimation Method for Outdoor Scenes

High Quality Depth Estimation from Monocular Images Based on Depth Prediction and Enhancement Sub-Networks

Edge-Enhanced Dual-Stream Perception Network for Monocular Depth Estimation

Synthetic Depth Transfer for Monocular 3D Object Pose Estimation in the Wild.

Attention-Based Monocular Depth Estimation Considering Global and Local Information in Remote Sensing Images

A Deep Joint Network for Monocular Depth Estimation Based on Pseudo-Depth Supervision

Monocular Depth Estimation Method Based on Plane Coefficient Representation with Adaptive Depth Distribution

Fast Monocular Depth Estimation via Side Prediction Aggregation with Continuous Spatial Refinement

Monocular Depth Estimation Of Outdoor Scenes Using Rgb-D Datasets

RGB-Fusion: Monocular 3D reconstruction with learned depth prediction

NDDepth: Normal-Distance Assisted Monocular Depth Estimation

Scene-aware refinement network for unsupervised monocular depth estimation in ultra-low altitude oblique photography of UAV

RCFNet: Related Cross-level Feature Network with Cascaded Self-distillation for Monocular Depth Estimation

Deep Monocular Depth Estimation Based on Content and Contextual Features