Abstract:In this paper, we propose a novel multi-view stereo (MVS) framework that gets rid of the depth range prior. Unlike recent prior-free MVS methods that work in a pair-wise manner, our method simultaneously considers all the source images. Specifically, we introduce a Multi-view Disparity Attention (MDA) module to aggregate long-range context information within and across multi-view images. Considering the asymmetry of the epipolar disparity flow, the key to our method lies in accurately modeling multi-view geometric constraints. We integrate pose embedding to encapsulate information such as multi-view camera poses, providing implicit geometric constraints for multi-view disparity feature fusion dominated by attention. Additionally, we construct corresponding hidden states for each source image due to significant differences in the observation quality of the same pixel in the reference frame across multiple source frames. We explicitly estimate the quality of the current pixel corresponding to sampled points on the epipolar line of the source image and dynamically update hidden states through the uncertainty estimation module. Extensive results on the DTU dataset and Tanks&Temple benchmark demonstrate the effectiveness of our method. The code is available at our project page: <a class="link-external link-https" href="https://zju3dv.github.io/GD-PoseMVS/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve a key problem in Multi - View Stereo (MVS) technology: **getting rid of the dependence on the depth - range prior**. Specifically, existing MVS methods usually need to pre - set a suitable depth range when dealing with real - world scenes, which is very difficult and sensitive in practical applications. If the depth range is not properly selected, it may lead to a significant performance degradation. #### Limitations of existing methods 1. **Traditional methods**: - Rely on hand - designed similarity measures and regularization to calculate the dense correspondences between input images. - Are prone to degradation in complex scenes such as illumination changes, texture - less areas and occluded areas. 2. **Learning - based methods**: - Use Convolutional Neural Networks (CNN) and Transformer to directly learn discriminative features from input images. - Sample possible depth hypotheses within a given depth range, warp the features of the source image into the reference view (i.e., the plane - sweep algorithm), then calculate the cost volume and perform regularization to obtain the final depth map. - Are very sensitive to the depth range, which limits their wide application. 3. **Recent methods without depth - range prior**: - Transform the regression problem in the depth space into a matching problem on the epipolar line and process them in a pairwise manner. - These methods ignore the cross - image correspondences between source images, which may lead to sub - optimal solutions. - Although these methods reduce the influence of the depth prior, the initialization still depends on the depth range, and when the depth range error is large, the performance will be significantly degraded. ### Solutions proposed in the paper To overcome the above problems, this paper proposes a new multi - view stereo framework, called **Global Depth - Range - Free Multi - View Stereo Transformer Network with Pose Embedding**. The main contributions include: 1. **Consider all source images simultaneously**: - Different from the existing pairwise processing methods, the new method can consider all source images simultaneously, thus making more comprehensive use of multi - view information. 2. **Introduce the Multi - View Disparity Attention (MDA) module**: - Improve the effect of feature fusion by aggregating long - distance context information within and across multi - view images. 3. **Model geometric constraints**: - Introduce Pose Embedding to encapsulate multi - view camera pose information, providing implicit geometric constraints to help the network better understand multi - view disparity features. 4. **Dynamically update the hidden state**: - Construct the hidden state corresponding to each source image, explicitly estimate the quality of the current pixel on the epipolar line of the source image, and dynamically update the hidden state through the uncertainty estimation module to adapt to the differences in observation quality between different source images. 5. **Depth - range - free initialization**: - Design a new initialization method to further eliminate the influence of the depth - range prior and improve robustness. ### Summary The core objective of this paper is to develop a multi - view stereo matching method that can get rid of the dependence on the depth - range prior, so as to be more robust and reliable in real - world applications. By introducing the multi - view disparity attention module and pose embedding, as well as dynamically updating the hidden state, this method can achieve better performance in complex scenes.

A Global Depth-Range-Free Multi-View Stereo Transformer Network with Pose Embedding

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

Multi-View Stereo Representation Revist: Region-Aware MVSNet

MTD-MVSNet: Multi-view Stereo Network with Multi-scale Transformer and Dual Attention

Modeling Long-Range Dependencies and Epipolar Geometry for Multi-View Stereo

Unsupervised multi-view stereo network based on multi-stage depth estimation

GeoMVSNet: Learning Multi-View Stereo with Geometry Perception

MVSTER: Epipolar Transformer for Efficient Multi-View Stereo

When Epipolar Constraint Meets Non-local Operators in Multi-View Stereo

Visibility-Aware Point-Based Multi-View Stereo Network

MVSNet: Depth Inference for Unstructured Multi-view Stereo

Self-supervised Multi-view Stereo Via Inter and Intra Network Pseudo Depth

A Global-Matching Framework For Multi-View Stereopsis

Hybrid-MVS: Robust Multi-View Reconstruction with Hybrid Optimization of Visual and Depth Cues

Feature‐enhanced representation with transformers for multi‐view stereo

A Light Multi-View Stereo Method with Patch-Uncertainty Awareness

EI-MVSNet: Epipolar-Guided Multi-View Stereo Network With Interval-Aware Label

Context-Guided Multi-view Stereo with Depth Back-Projection

Multi-View Stereo Network with attention thin volume

RIAV-MVS: Recurrent-Indexing an Asymmetric Volume for Multi-View Stereo

Confidence-Based Large-Scale Dense Multi-View Stereo