Abstract:Learning-based multi-view stereo (MVS) has by far centered around 3D convolution on cost volumes. Due to the high computation and memory consumption of 3D CNN, the resolution of output depth is often considerably limited. Different from most existing works dedicated to adaptive refinement of cost volumes, we opt to directly optimize the depth value along each camera ray, mimicking the range finding of a laser scanner. This reduces the MVS problem to ray-based depth optimization which is much more light-weight than full cost volume optimization. In particular, we propose RayMVSNet which learns sequential prediction of a 1D implicit field along each camera ray with the zero-crossing point indicating scene depth. This sequential modeling, conducted based on transformer features, essentially learns the epipolar line search in traditional multi-view stereo. We devise a multi-task learning for better optimization convergence and depth accuracy. We found the monotonicity property of the SDFs along each ray greatly benefits the depth estimation. Our method ranks top on both the DTU and the Tanks & Temples datasets over all previous learning-based methods, achieving an overall reconstruction score of 0.33mm on DTU and an F-score of 59.48% on Tanks & Temples. It is able to produce high-quality depth estimation and point cloud reconstruction in challenging scenarios such as objects/scenes with non-textured surface, severe occlusion, and highly varying depth range. Further, we propose RayMVSNet++ to enhance contextual feature aggregation for each ray through designing an attentional gating unit to select semantically relevant neighboring rays within the local frustum around that ray. RayMVSNet++ achieves state-of-the-art performance on the ScanNet dataset. In particular, it attains an AbsRel of 0.058m and produces accurate results on the two subsets of textureless regions and large depth variation.

Recurrent Mvsnet For High-Resolution Multi-View Stereo Depth Inference

Bidirectional Hybrid LSTM Based Recurrent Neural Network for Multi-View Stereo.

Multi-View Stereo Representation Revist: Region-Aware MVSNet

DSC-MVSNet: attention aware cost volume regularization based on depthwise separable convolution for multi-view stereo

MVSNet: Depth Inference for Unstructured Multi-view Stereo

Unsupervised multi-view stereo network based on multi-stage depth estimation

RIAV-MVS: Recurrent-Indexing an Asymmetric Volume for Multi-View Stereo

Attention Aware Cost Volume Pyramid Based Multi-view Stereo Network for 3D Reconstruction

AA-RMVSNet: Adaptive Aggregation Recurrent Multi-view Stereo Network

NR-MVSNet: Learning Multi-View Stereo Based on Normal Consistency and Depth Refinement

HC-MVSNet: A Probability Sampling-Based Multi-View-stereo Network with Hybrid Cascade Structure for 3D Reconstruction

LNMVSNet: A Low-Noise Multi-View Stereo Depth Inference Method for 3D Reconstruction

RayMVSNet++: Learning Ray-based 1D Implicit Fields for Accurate Multi-View Stereo

Multi-View Stereo Network with attention thin volume

Multi-View Stereo Network Based on Attention Mechanism and Neural Volume Rendering

EPP-MVSNet: Epipolar-assembling based Depth Prediction for Multi-view Stereo

EI-MVSNet: Epipolar-Guided Multi-View Stereo Network With Interval-Aware Label

N2MVSNet: Non-Local Neighbors Aware Multi-View Stereo Network

MTD-MVSNet: Multi-view Stereo Network with Multi-scale Transformer and Dual Attention

Attention-enhanced multi-source cost volume multi-view stereo

Cost Volume Pyramid Based Depth Inference for Multi-View Stereo