Non-parametric Depth Distribution Modelling based Depth Inference for Multi-view Stereo

Jiayu Yang,Jose M. Alvarez,Miaomiao Liu
DOI: https://doi.org/10.48550/arXiv.2205.03783
2022-05-08
Abstract:Recent cost volume pyramid based deep neural networks have unlocked the potential of efficiently leveraging high-resolution images for depth inference from multi-view stereo. In general, those approaches assume that the depth of each pixel follows a unimodal distribution. Boundary pixels usually follow a multi-modal distribution as they represent different depths; Therefore, the assumption results in an erroneous depth prediction at the coarser level of the cost volume pyramid and can not be corrected in the refinement levels leading to wrong depth predictions. In contrast, we propose constructing the cost volume by non-parametric depth distribution modeling to handle pixels with unimodal and multi-modal distributions. Our approach outputs multiple depth hypotheses at the coarser level to avoid errors in the early stage. As we perform local search around these multiple hypotheses in subsequent levels, our approach does not maintain the rigid depth spatial ordering and, therefore, we introduce a sparse cost aggregation network to derive information within each volume. We evaluate our approach extensively on two benchmark datasets: DTU and Tanks & Temples. Our experimental results show that our model outperforms existing methods by a large margin and achieves superior performance on boundary regions. Code is available at <a class="link-external link-https" href="https://github.com/NVlabs/NP-CVP-MVSNet" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper aims to address the issue of inaccurate depth estimation in boundary regions in Multi-view Stereo (MVS). Specifically, existing depth neural network methods based on cost volume pyramids typically assume that the depth of each pixel follows a unimodal distribution. However, boundary pixels often have a multimodal distribution because they represent different depth values. This unimodal distribution assumption leads to erroneous depth predictions at the coarse level of the cost volume pyramid, and these errors cannot be corrected in subsequent refinement levels, resulting in inaccurate final depth predictions. ### Solution To solve the above problem, the authors propose a non-parametric depth distribution modeling method to construct the cost volume. The specific contributions are as follows: 1. **Non-parametric Depth Probability Distribution Modeling**: Allows handling pixels with both unimodal and multimodal distributions. 2. **Cost Volume Pyramid Construction**: Constructs the cost volume pyramid based on the modeled pixel-level depth probability distribution by branching depth samples. 3. **Sparse Cost Aggregation Network**: Processes each cost volume, maintaining the rigid geometric spatial relationships in the cost volume to avoid spatial blurring. 4. **Performance Improvement**: Experimental results show that this method outperforms existing methods in depth estimation performance in boundary regions, especially on the DTU dataset. ### Method Overview 1. **Non-parametric Depth Distribution Modeling**: - Assumes that the depth \( d \) of each pixel follows a continuous probability distribution \( P_p(d) \). - Approximates this continuous distribution through discrete depth hypotheses \( \{d_{p,m}\}_{m=1}^{M_l} \). - Uses a histogram constructed from high-resolution depth map observations as the ground truth probability distribution. 2. **Cost Volume Pyramid Construction**: - Constructs the initial cost volume \( C_L \) at the coarsest level \( L \). - Uses 3D-UNet for cost aggregation, outputting the probability volume \( P_L \). - Selects the top \( K \) depth hypotheses with the highest probabilities to construct the cost volume for the next level. 3. **Sparse Cost Aggregation Network**: - Uses a sparse convolution structure to aggregate information, maintaining rigid spatial relationships. - The basic block of the network includes three layers of sparse 3D convolution, sparse batch normalization, and sparse ReLU activation functions. 4. **Full-resolution Depth Inference**: - Performs depth inference at the highest resolution level \( 0 \), approximating the depth of each pixel as the expected value of its estimated distribution. ### Experimental Results 1. **DTU Dataset**: - Quantitative Results: This method outperforms all existing methods in terms of average completeness and overall score. - Boundary Region Performance: Achieves the lowest average depth error in the boundary region (R0). 2. **Tanks and Temples Dataset**: - Generalization Test: Without fine-tuning, the model shows competitive performance on the Tanks and Temples dataset, particularly excelling in depth estimation in boundary regions. ### Summary By introducing non-parametric depth distribution modeling and a sparse cost aggregation network, this paper effectively addresses the issue of inaccurate depth estimation in boundary regions in multi-view stereo, significantly improving the accuracy and robustness of depth estimation.