Abstract:In this paper, we introduce Segmentation-Driven Deformation Multi-View Stereo (SD-MVS), a method that can effectively tackle challenges in 3D reconstruction of textureless areas. We are the first to adopt the Segment Anything Model (SAM) to distinguish semantic instances in scenes and further leverage these constraints for pixelwise patch deformation on both matching cost and propagation. Concurrently, we propose a unique refinement strategy that combines spherical coordinates and gradient descent on normals and pixelwise search interval on depths, significantly improving the completeness of reconstructed 3D model. Furthermore, we adopt the Expectation-Maximization (EM) algorithm to alternately optimize the aggregate matching cost and hyperparameters, effectively mitigating the problem of parameters being excessively dependent on empirical tuning. Evaluations on the ETH3D high-resolution multi-view stereo benchmark and the Tanks and Temples dataset demonstrate that our method can achieve state-of-the-art results with less time consumption.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper "SD-MVS: Segmentation-Driven Deformation Multi-View Stereo with Spherical Refinement and EM optimization" aims to address the challenges faced by Multi-View Stereo (MVS) technology in 3D reconstruction in textureless regions. Specifically, the paper identifies the following main issues: 1. **Inaccurate Depth Estimation in Textureless Regions**: - Traditional MVS methods struggle with depth estimation in textureless regions due to the lack of texture information. - Existing methods have attempted to improve this through techniques like planar priors and superpixel segmentation, but their performance in large textureless areas remains unsatisfactory. 2. **Parameter Dependence on Empirical Tuning**: - In current MVS methods, parameters often need to be manually adjusted, which is time-consuming and can lead to suboptimal results. 3. **High Memory and Time Consumption**: - Learning-based MVS methods can improve reconstruction quality but often come with high time and memory costs, limiting their practical application. 4. **Insufficient Utilization of Edge Information**: - Edge information is crucial in image processing, but existing methods inadequately utilize edge information, especially in complex scenes where shadows and occlusions weaken the association between edges and depth boundaries. ### Solutions To address the above issues, the paper proposes a new method—**Segmentation-Driven Deformation Multi-View Stereo (SD-MVS)**, with the following main contributions: 1. **Instance Segmentation-Based Adaptive Patch Deformation**: - Utilizes the Segment Anything Model (SAM) for instance segmentation to extract fine edge information while ignoring strong lighting interference. - Through adaptive deformation patches, it better utilizes image edge information, improving the accuracy of matching costs and propagation. 2. **Spherical Gradient Refinement**: - Introduces a spherical coordinate system and gradient descent method to optimize the search accuracy of normals and depth. - By randomly selecting two orthogonal unit vectors for perturbation and further optimizing the perturbation direction with gradient descent, it improves the accuracy of each hypothesis. 3. **EM Algorithm-Based Hyperparameter Optimization**: - Employs the Expectation-Maximization (EM) algorithm to alternately optimize aggregated matching costs and hyperparameters, achieving automatic parameter tuning and balancing different information considerations. 4. **Multi-Scale Consistency Architecture**: - Introduces a multi-scale consistency architecture to reduce memory consumption and improve operational efficiency. - By parallel loading images of different scales, it replaces the traditional cascade architecture, reducing data transfer time between the CPU and GPU. ### Experimental Results The paper evaluates the SD-MVS method on the ETH3D high-resolution multi-view stereo benchmark and the Tanks and Temples dataset. The results show that the SD-MVS method achieves state-of-the-art performance while reducing time consumption. ### Conclusion By introducing techniques such as instance segmentation, spherical gradient refinement, and EM algorithm optimization, the paper effectively addresses issues like inaccurate depth estimation in textureless regions, parameter dependence on empirical tuning, and high memory and time consumption in MVS. This provides new insights for the development of multi-view stereo technology.

SD-MVS: Segmentation-Driven Deformation Multi-View Stereo with Spherical Refinement and EM optimization

Hybrid-MVS: Robust Multi-View Reconstruction with Hybrid Optimization of Visual and Depth Cues

Multi-View Stereo Representation Revist: Region-Aware MVSNet

SDL-MVS: View Space and Depth Deformable Learning Paradigm for Multi-View Stereo Reconstruction in Remote Sensing

TSAR-MVS: Textureless-aware Segmentation and Correlative Refinement Guided Multi-View Stereo

MSP-MVS: Multi-granularity Segmentation Prior Guided Multi-View Stereo

DP-MVS: Detail Preserving Multi-View Surface Reconstruction of Large-Scale Scenes

High-Quality Depth Recovery Via Interactive Multi-view Stereo

MP-MVS: Multi-Scale Windows PatchMatch and Planar Prior Multi-View Stereo

PM-PM: PatchMatch with Potts Model for Object Segmentation and Stereo Matching.

Multi-View Stereo Representation Revisit: Region-Aware MVSNet

Mono‐MVS: textureless‐aware multi‐view stereo assisted by monocular prediction

A Multitask Network for Multiview Stereo Reconstruction: When Semantic Consistency-Based Clustering Meets Depth Estimation Optimization

A Confidence-based Iterative Solver of Depths and Surface Normals for Deep Multi-view Stereo

High completeness multi-view stereo for dense reconstruction of large-scale urban scenes

NR-MVSNet: Learning Multi-View Stereo Based on Normal Consistency and Depth Refinement

EPP-MVSNet: Epipolar-assembling based Depth Prediction for Multi-view Stereo

Rethinking the Multi-view Stereo from the Perspective of Rendering-based Augmentation

ElasticMVS: Learning elastic part representation for self-supervised multi-view stereopsis

RayMVSNet++: Learning Ray-based 1D Implicit Fields for Accurate Multi-View Stereo

Enhanced multi view 3D reconstruction with improved MVSNet