Abstract:Multi-view stereo (MVS) reconstruction is a key task of image-based 3D reconstruction, and deep learning-based methods can achieve better results than traditional algorithms. However, most of the current deep learning-based MVS methods use convolutional neural networks (CNNs) to extract image features, which cannot achieve the aggregation of long-distance context information and capture robust global information. In addition, in the process of fusing depth maps into point clouds, the confidence filters will filter out the depth values with low confidence in weak texture areas. These problems will lead to the low completeness of 3D reconstruction of weak texture and texture-less areas. To address the above problems, this paper proposes SA-MVSNet based on the PatchmatchNet with a self-attentive mechanism. First, we design a coarse-to-fine network framework to advance depth map estimation. In the feature extraction network, a module with a pyramid structure based on Swin Transformer Block is used to replace the original Feature Pyramid Network (FPN), and the self-correlation between weak texture areas is enhanced by applying a global self-attention mechanism. Then, we also propose a self-attention-based adaptive propagation module (SA-AP), which applies a self-attention calculation within depth value propagation window to obtain the relative weight values of current pixel and others, and then adaptively samples the depth values of neighbors on the same surface for propagation. Experiments show that SA-MVSNet has significantly improved the completeness of 3D reconstruction for the images with weak texture on DTU (provided by Danish Technical University), BlendedMVS, and Tanks and Temple datasets. Our code is available at https://github.com/miaowang525/SA-MVSNet.

MTD-MVSNet: Multi-view Stereo Network with Multi-scale Transformer and Dual Attention

Multi-View Stereo Network Based on Attention Mechanism and Neural Volume Rendering

Attention-enhanced multi-source cost volume multi-view stereo

DSC-MVSNet: attention aware cost volume regularization based on depthwise separable convolution for multi-view stereo

Multi-View Stereo Representation Revist: Region-Aware MVSNet

Multi-View Stereo Network with attention thin volume

MFE‐MVSNet: Multi‐scale feature enhancement multi‐view stereo with bi‐directional connections

HC-MVSNet: A Probability Sampling-Based Multi-View-stereo Network with Hybrid Cascade Structure for 3D Reconstruction

MVSFormer++: Revealing the Devil in Transformer's Details for Multi-View Stereo

EI-MVSNet: Epipolar-Guided Multi-View Stereo Network With Interval-Aware Label

NR-MVSNet: Learning Multi-View Stereo Based on Normal Consistency and Depth Refinement

Mono‐MVS: textureless‐aware multi‐view stereo assisted by monocular prediction

Visibility-Aware Point-Based Multi-View Stereo Network

MVSNet: Depth Inference for Unstructured Multi-view Stereo

MVSTER: Epipolar Transformer for Efficient Multi-View Stereo

A Global Depth-Range-Free Multi-View Stereo Transformer Network with Pose Embedding

BSI-MVS: multi-view stereo network with bidirectional semantic information

Modeling Long-Range Dependencies and Epipolar Geometry for Multi-View Stereo

SA-MVSNet: Self-attention-based multi-view stereo network for 3D reconstruction of images with weak texture

N2MVSNet: Non-Local Neighbors Aware Multi-View Stereo Network

Transformer-guided Feature Pyramid Network for Multi-View Stereo