���于平面系数表示的自适应深度分布单目深度估计方法
王家骏 Wang Jiajun,刘越 Liu Yue,吴宇晖 Wu Yuhui,沙浩 Sha Hao,王涌天 Wang Yongtian
DOI: https://doi.org/10.3788/aos230468
2023-01-01
Abstract:Objective Obtaining scene depth is crucial in 3D reconstruction, autonomous driving and other related tasks. Current methods based on Lidar or Time of Flight(ToF) cameras are not widely applicable due to its high cost. In contrast, using only single RGB image to infer scene depth information is more cost-effective, which has broader potential for more applications. Inspired by the successful applications of deep learning methods in various ill-posed problems recently,many researchers tend to use convolutional neural networks to estimate reasonable and accurate monocular depth. However, most existing researches based on deep learning focus on how to enhance the feature extraction capability of the network, while ignoring paying attention to the distribution of image depth. Estimating the pixel distributions of images can not only improve the precision of the inference, but also make the reconstructed 3D images be more consistent with ground truth. Therefore, this paper proposes a new adaptive depth distribution module,which allows the model to predict different depth distributions for each image during the training process.Methods The NYU-Depth v2 dataset created by New York University was used in this work.Overall, our model was built based on the encoder-decoder structure with skip connections,which had been proved to be able to guide image generation more effectively. An indirect representation of depth maps based on plane coefficient is also introduced to implicitly add the plane constraint in the process of the depth estimation and to obtain smoother depth estimation results in the plane region of the scene. In specific, two sub-networks with different lightweight designs were used at the bottleneck and other upsampling stages in the network to enhance the model’s feature extraction capability. Apart from that, an adaptive depth distribution estimation module is also designed, which can estimate different depth distributions according to different input images, so that the network can better predict the relative position between indoor objects and make the pixel distribution of depth maps closer to the ground truth. We employed a twostage training strategy. In the first stage, we loaded the pretrained weights on ImageNet into the backbone network, and optimized the model using the loss function in 2D level only. In the second stage, we performed joint training using loss functions at both the 2D and 3D levels.Root Mean Square Error(RMSE), Relative Error(REL), Intersection over Union(IoU) and other metrics are used during the evaluation to verify the qualitative and quantitative results of the model.Results and Discussions This study employed multiple metrics including RMSE, REL and IoU to qualitatively evaluate the inference ability of the proposed model. As shown in Table 1,the lightweight network model proposed in this study outperforms most of the listed methods with only 46M parameters, which proves the overall structure of the model to be concise and effective. The visual comparison results of 3D depth reconstruction(Fig.5) demonstrates that the proposed network can output smoother and more continuous depth predictions in planar regions, as well as reasonable predictions in the partially occluded or missing areas of planar regions. In terms of the depth distribution, the carefully designed adaptive depth distribution module can make the predicted distribution fit better with the ground truth in the trend of the curve, and can get higher IoU rate compared with other methods(Fig.6, Table 3), which demonstrates the effectiveness of the proposed module. Furthermore, the lightweight network can balance accuracy and speed in real-time scenarios(Table 2) and achieve good inference and reconstruction results. Nevertheless, the proposed network has some limitations in recovering fine details of the depth predictions(Fig.7), thus how to design the network to recover more depth details while ensuring the model’s real-time prediction performance will be the focus of our future work.Conclusions This article presents an innovative model based on plane coefficient representation with adaptive depth distribution predictor for monocular image depth estimation tasks. Qualitative and quantitative results obtained from the NYU Depth-v2 dataset and multiple comparative experiments demonstrate that the proposed method is capable of obtaining reasonable prediction results for planar regions in images with partial occlusions or small viewing angles. Additionally, the depth distribution prediction module proposed in this article provides differentiated pixel distribution optimization for each image, which can make the model achieve pixel depth distribution prediction results that are closer to the real image.With its lightweight design, this method achieves a balance between inference speed and inference accuracy, making it highly applicable in practical scenarios that require both realtime and accuracy, such as indoor virtual reality and human-computer interaction.