A Global Depth-Range-Free Multi-View Stereo Transformer Network with Pose Embedding

Yitong Dong,Yijin Li,Zhaoyang Huang,Weikang Bian,Jingbo Liu,Hujun Bao,Zhaopeng Cui,Hongsheng Li,Guofeng Zhang
2024-11-04
Abstract:In this paper, we propose a novel multi-view stereo (MVS) framework that gets rid of the depth range prior. Unlike recent prior-free MVS methods that work in a pair-wise manner, our method simultaneously considers all the source images. Specifically, we introduce a Multi-view Disparity Attention (MDA) module to aggregate long-range context information within and across multi-view images. Considering the asymmetry of the epipolar disparity flow, the key to our method lies in accurately modeling multi-view geometric constraints. We integrate pose embedding to encapsulate information such as multi-view camera poses, providing implicit geometric constraints for multi-view disparity feature fusion dominated by attention. Additionally, we construct corresponding hidden states for each source image due to significant differences in the observation quality of the same pixel in the reference frame across multiple source frames. We explicitly estimate the quality of the current pixel corresponding to sampled points on the epipolar line of the source image and dynamically update hidden states through the uncertainty estimation module. Extensive results on the DTU dataset and Tanks&Temple benchmark demonstrate the effectiveness of our method. The code is available at our project page: <a class="link-external link-https" href="https://zju3dv.github.io/GD-PoseMVS/" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve a key problem in Multi - View Stereo (MVS) technology: **getting rid of the dependence on the depth - range prior**. Specifically, existing MVS methods usually need to pre - set a suitable depth range when dealing with real - world scenes, which is very difficult and sensitive in practical applications. If the depth range is not properly selected, it may lead to a significant performance degradation. #### Limitations of existing methods 1. **Traditional methods**: - Rely on hand - designed similarity measures and regularization to calculate the dense correspondences between input images. - Are prone to degradation in complex scenes such as illumination changes, texture - less areas and occluded areas. 2. **Learning - based methods**: - Use Convolutional Neural Networks (CNN) and Transformer to directly learn discriminative features from input images. - Sample possible depth hypotheses within a given depth range, warp the features of the source image into the reference view (i.e., the plane - sweep algorithm), then calculate the cost volume and perform regularization to obtain the final depth map. - Are very sensitive to the depth range, which limits their wide application. 3. **Recent methods without depth - range prior**: - Transform the regression problem in the depth space into a matching problem on the epipolar line and process them in a pairwise manner. - These methods ignore the cross - image correspondences between source images, which may lead to sub - optimal solutions. - Although these methods reduce the influence of the depth prior, the initialization still depends on the depth range, and when the depth range error is large, the performance will be significantly degraded. ### Solutions proposed in the paper To overcome the above problems, this paper proposes a new multi - view stereo framework, called **Global Depth - Range - Free Multi - View Stereo Transformer Network with Pose Embedding**. The main contributions include: 1. **Consider all source images simultaneously**: - Different from the existing pairwise processing methods, the new method can consider all source images simultaneously, thus making more comprehensive use of multi - view information. 2. **Introduce the Multi - View Disparity Attention (MDA) module**: - Improve the effect of feature fusion by aggregating long - distance context information within and across multi - view images. 3. **Model geometric constraints**: - Introduce Pose Embedding to encapsulate multi - view camera pose information, providing implicit geometric constraints to help the network better understand multi - view disparity features. 4. **Dynamically update the hidden state**: - Construct the hidden state corresponding to each source image, explicitly estimate the quality of the current pixel on the epipolar line of the source image, and dynamically update the hidden state through the uncertainty estimation module to adapt to the differences in observation quality between different source images. 5. **Depth - range - free initialization**: - Design a new initialization method to further eliminate the influence of the depth - range prior and improve robustness. ### Summary The core objective of this paper is to develop a multi - view stereo matching method that can get rid of the dependence on the depth - range prior, so as to be more robust and reliable in real - world applications. By introducing the multi - view disparity attention module and pose embedding, as well as dynamically updating the hidden state, this method can achieve better performance in complex scenes.