Abstract:Dense and accurate 3D mapping from a monocular sequence is a key technology for several applications and still an open research area. This paper leverages recent results on single-view CNN-based depth estimation and fuses them with multi-view depth estimation. Both approaches present complementary strengths. Multi-view depth is highly accurate but only in high-texture areas and high-parallax cases. Single-view depth captures the local structure of mid-level regions, including texture-less areas, but the estimated depth lacks global coherence. The single and multi-view fusion we propose is challenging in several aspects. First, both depths are related by a deformation that depends on the image content. Second, the selection of multi-view points of high accuracy might be difficult for low-parallax configurations. We present contributions for both problems. Our results in the public datasets of NYUv2 and TUM shows that our algorithm outperforms the individual single and multi-view approaches. A video showing the key aspects of mapping in our Single and Multi-view depth proposal is available at <a class="link-external link-https" href="https://youtu.be/ipc5HukTb4k" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to achieve dense and accurate 3D scene reconstruction in monocular image sequences. Specifically, the paper focuses on how to fuse single - view depth estimation and multi - view depth estimation to overcome the limitations of their respective methods, so as to obtain more accurate depth estimates even in low - texture areas and low - disparity configurations. ### Background and Problem Description - **Single - view depth estimation**: Methods based on deep convolutional neural networks (CNNs) can capture local structures, including texture - free regions, but the estimated depth lacks global consistency. - **Multi - view depth estimation**: This method is very accurate in high - texture areas and high - disparity situations, but performs poorly in low - texture areas and low - disparity configurations. ### Main Contributions of the Paper 1. **Fusion of single - view and multi - view depth estimation**: - **Challenges**: - The relationship between single - view and multi - view depths depends on the image content and there are content - related deformations. - It is difficult to select high - precision multi - view points in low - disparity configurations. - **Solutions**: - A weighted interpolation - based method is proposed, which uses the quality and influence area of multi - view semi - dense depth to fuse the local structure of single - view. - Four weight factors are designed to model the deformations based on local image structures. These factors consider pixel distance, depth gradient similarity, the influence of in - plane points, etc., respectively. 2. **Multi - view low - error point selection**: - In low - disparity geometric configurations, multi - view depth may contain large errors, and these error points need to be filtered out. - A two - step algorithm is developed. It combines photometric and geometric information for preliminary screening, and then uses the single - view depth map for further screening. Finally, a set of low - error points are obtained for interpolation. ### Experimental Results - **Datasets**: - NYUv2 Depth Dataset: It contains low - disparity and low - texture sequences. - TUM RGB - D SLAM Dataset: It is in favor of the advantages of multi - view depth. - **Evaluation Metrics**: - Root Mean Square Error (RMSE) - Mean Absolute Error (MAE) - Scale Invariant Root Mean Square Error (Scale Invariant RMSE) - **Performance Comparison**: - Compared with using only multi - view depth estimation (TV regularization) and single - view depth estimation, the fusion method shows significant performance improvement on both datasets. - On the NYUv2 dataset, the average improvement is more than 50%, and there is a similar performance on the TUM dataset. - Compared with single - view depth estimation, the improvement of the fusion method is about 10%. ### Conclusion The paper proposes a method for fusing single - view and multi - view depth estimation. By combining the advantages of the two methods, it effectively improves the depth estimation accuracy in low - texture areas and low - disparity configurations. The experimental results show that this method is superior to the existing single - view and multi - view depth estimation methods on multiple datasets.

Single-View and Multi-View Depth Fusion

MFF-Net: Towards Efficient Monocular Depth Completion With Multi-Modal Feature Fusion

Mono-ViFI: A Unified Learning Framework for Self-supervised Single- and Multi-frame Monocular Depth Estimation

FusionDepth: Complement Self-Supervised Monocular Depth Estimation with Cost Volume

CNN-SLAM: Real-Time Dense Monocular SLAM with Learned Depth Prediction

Unveiling the Depths: A Multi-Modal Fusion Framework for Challenging Scenarios

Learning to Fuse Monocular and Multi-view Cues for Multi-frame Depth Estimation in Dynamic Scenes

Monocular depth estimation with hierarchical fusion of dilated CNNs and soft-weighted-sum inference

Continuous Depth Estimation for Multi-View Stereo

Multi-view depth estimation based on multi-feature aggregation for 3D reconstruction

Learned Semantic Multi-Sensor Depth Map Fusion

VolumeFusion: Deep Depth Fusion for 3D Scene Reconstruction

Probabilistic Multimodal Depth Estimation Based on Camera-LiDAR Sensor Fusion

Monocular Depth Estimation Based on Dilated Convolutions and Feature Fusion

FusionMapping: Learning Depth Prediction with Monocular Images and 2D Laser Scans

Multi-view Stereo Via Depth Map Fusion: A Coordinate Decent Optimization Method

DELTAS: Depth Estimation by Learning Triangulation And densification of Sparse points

Promoting Monocular Depth Estimation by Multi-Scale Residual Laplacian Pyramid Fusion

Towards Deeper and Better Multi-view Feature Fusion for 3D Semantic Segmentation

Enhanced Monocular Depth Estimation: A CNN Integrating Semantic Segmentation Embedding And Vanishing Point Detection

360MonoDepth: High-Resolution 360° Monocular Depth Estimation