Abstract:Recent advancements in learning-based Multi-View Stereo (MVS) methods have prominently featured transformer-based models with attention mechanisms. However, existing approaches have not thoroughly investigated the profound influence of transformers on different MVS modules, resulting in limited depth estimation capabilities. In this paper, we introduce MVSFormer++, a method that prudently maximizes the inherent characteristics of attention to enhance various components of the MVS pipeline. Formally, our approach involves infusing cross-view information into the pre-trained DINOv2 model to facilitate MVS learning. Furthermore, we employ different attention mechanisms for the feature encoder and cost volume regularization, focusing on feature and spatial aggregations respectively. Additionally, we uncover that some design details would substantially impact the performance of transformer modules in MVS, including normalized 3D positional encoding, adaptive attention scaling, and the position of layer normalization. Comprehensive experiments on DTU, Tanks-and-Temples, BlendedMVS, and ETH3D validate the effectiveness of the proposed method. Notably, MVSFormer++ achieves state-of-the-art performance on the challenging DTU and Tanks-and-Temples benchmarks.

What problem does this paper attempt to address?

This paper attempts to solve the problem of limited depth - estimation ability of Transformer - based models in multi - view stereo (MVS) reconstruction. Specifically, although existing Transformer - based methods perform well in MVS tasks, the far - reaching impact of Transformer on different MVS modules has not been fully studied, resulting in limited depth - estimation ability. To this end, the authors propose MVSFormer++, which enhances each component of the MVS pipeline by maximizing the inherent characteristics of the attention mechanism. ### Main Problems and Challenges 1. **Customized Attention Mechanism**: - There are two main components in the MVS learning framework: the feature encoder and the cost - volume regularization. These two modules should not rely on the same attention mechanism because their feature properties are different. 2. **Integrating Cross - View Information into Pretrained ViT**: - Although pretrained ViT (such as DINOv2) provides significant improvements in MVSFormer, it is still necessary to strengthen the feature interaction between different views. Existing cross - view pretrained ViT is difficult to fully solve the multi - view correlation problem. 3. **Enhancing the Length - Extrapolation Ability of Transformer in MVS**: - There are significant differences in image sizes during the training and testing phases, especially when feature matching is performed at higher resolutions, and the accuracy is often higher. However, enabling Transformer to generalize effectively to different sequence lengths like convolutional neural networks (CNN) is a major challenge. ### Solutions 1. **Introducing Side View Attention (SVA)**: - To enhance the cross - view learning ability of the DINOv2 model, the authors designed the SVA module, which gradually injects the cross - view attention mechanism, using the linear attention mechanism (Katharopoulos et al., 2020), thus performing excellently in the feature - encoding stage. 2. **Innovative 3D Frustoconical Positional Encoding (FPE)**: - FPE provides globally standardized 3D position cues, enhancing the ability to handle extended - length 3D sequences, and is especially suitable for cost - volume regularization. 3. **Adaptive Attention Scaling (AAS)**: - To solve the attention - dilution problem, the authors re - proposed AAS to ensure that the attention scores remain stable under different sequence lengths. 4. **Well - Designed Cost Volume Transformer (CVT)**: - CVT aggregates features in the spatial dimension through a simple linear attention mechanism, significantly improving the final reconstruction quality and reducing outliers. ### Summary MVSFormer++ solves the key problems existing in existing Transformer - based MVS methods through customized attention mechanisms, the introduction of SVA, in - depth study of Transformer design, and the proposal of innovative methods such as FPE and AAS, achieving state - of - the - art performance on multiple benchmark datasets.

MVSFormer++: Revealing the Devil in Transformer's Details for Multi-View Stereo

MVSFormer: Multi-View Stereo by Learning Robust Image Features and Temperature-based Depth

MVSFormer: Multi-View Stereo by Learning Robust Image Features and Temperature-based Depth

MTD-MVSNet: Multi-view Stereo Network with Multi-scale Transformer and Dual Attention

MVSTER: Epipolar Transformer for Efficient Multi-View Stereo

Feature‐enhanced representation with transformers for multi‐view stereo

MAFormer: A transformer network with multi-scale attention fusion for visual recognition

Constituent Attention for Vision Transformers

CT-MVSNet: Curvature-guided multi-view stereo with transformers

Modeling Long-Range Dependencies and Epipolar Geometry for Multi-View Stereo

Transformer-guided Feature Pyramid Network for Multi-View Stereo

CostFormer:Cost Transformer for Cost Aggregation in Multi-view Stereo

CostFormer: Cost Transformer for Cost Aggregation in Multi-view Stereo

Multi-View Stereo Network Based on Attention Mechanism and Neural Volume Rendering

Enhanced Window-Based Self-Attention with Global and Multi-Scale Representations for Remote Sensing Image Super-Resolution

Attention-enhanced multi-source cost volume multi-view stereo

MVFormer: Diversifying Feature Normalization and Token Mixing for Efficient Vision Transformers

DAT++: Spatially Dynamic Vision Transformer with Deformable Attention

DuoFormer: Leveraging Hierarchical Visual Representations by Local and Global Attention

Improving Vision Transformers by Overlapping Heads in Multi-Head Self-Attention

NR-MVSNet: Learning Multi-View Stereo Based on Normal Consistency and Depth Refinement