Single-View and Multi-View Depth Fusion

José M. Fácil,Alejo Concha,Luis Montesano,Javier Civera
DOI: https://doi.org/10.1109/LRA.2017.2715400
2017-06-27
Abstract:Dense and accurate 3D mapping from a monocular sequence is a key technology for several applications and still an open research area. This paper leverages recent results on single-view CNN-based depth estimation and fuses them with multi-view depth estimation. Both approaches present complementary strengths. Multi-view depth is highly accurate but only in high-texture areas and high-parallax cases. Single-view depth captures the local structure of mid-level regions, including texture-less areas, but the estimated depth lacks global coherence. The single and multi-view fusion we propose is challenging in several aspects. First, both depths are related by a deformation that depends on the image content. Second, the selection of multi-view points of high accuracy might be difficult for low-parallax configurations. We present contributions for both problems. Our results in the public datasets of NYUv2 and TUM shows that our algorithm outperforms the individual single and multi-view approaches. A video showing the key aspects of mapping in our Single and Multi-view depth proposal is available at <a class="link-external link-https" href="https://youtu.be/ipc5HukTb4k" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition,Robotics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to achieve dense and accurate 3D scene reconstruction in monocular image sequences. Specifically, the paper focuses on how to fuse single - view depth estimation and multi - view depth estimation to overcome the limitations of their respective methods, so as to obtain more accurate depth estimates even in low - texture areas and low - disparity configurations. ### Background and Problem Description - **Single - view depth estimation**: Methods based on deep convolutional neural networks (CNNs) can capture local structures, including texture - free regions, but the estimated depth lacks global consistency. - **Multi - view depth estimation**: This method is very accurate in high - texture areas and high - disparity situations, but performs poorly in low - texture areas and low - disparity configurations. ### Main Contributions of the Paper 1. **Fusion of single - view and multi - view depth estimation**: - **Challenges**: - The relationship between single - view and multi - view depths depends on the image content and there are content - related deformations. - It is difficult to select high - precision multi - view points in low - disparity configurations. - **Solutions**: - A weighted interpolation - based method is proposed, which uses the quality and influence area of multi - view semi - dense depth to fuse the local structure of single - view. - Four weight factors are designed to model the deformations based on local image structures. These factors consider pixel distance, depth gradient similarity, the influence of in - plane points, etc., respectively. 2. **Multi - view low - error point selection**: - In low - disparity geometric configurations, multi - view depth may contain large errors, and these error points need to be filtered out. - A two - step algorithm is developed. It combines photometric and geometric information for preliminary screening, and then uses the single - view depth map for further screening. Finally, a set of low - error points are obtained for interpolation. ### Experimental Results - **Datasets**: - NYUv2 Depth Dataset: It contains low - disparity and low - texture sequences. - TUM RGB - D SLAM Dataset: It is in favor of the advantages of multi - view depth. - **Evaluation Metrics**: - Root Mean Square Error (RMSE) - Mean Absolute Error (MAE) - Scale Invariant Root Mean Square Error (Scale Invariant RMSE) - **Performance Comparison**: - Compared with using only multi - view depth estimation (TV regularization) and single - view depth estimation, the fusion method shows significant performance improvement on both datasets. - On the NYUv2 dataset, the average improvement is more than 50%, and there is a similar performance on the TUM dataset. - Compared with single - view depth estimation, the improvement of the fusion method is about 10%. ### Conclusion The paper proposes a method for fusing single - view and multi - view depth estimation. By combining the advantages of the two methods, it effectively improves the depth estimation accuracy in low - texture areas and low - disparity configurations. The experimental results show that this method is superior to the existing single - view and multi - view depth estimation methods on multiple datasets.