Abstract:Multi-view stereo infers the 3D geometry from a set of images captured from several known positions and viewpoints. It is one of the most important components of 3D reconstruction. Recently, deep learning has been increasingly used to solve several 3D vision problems due to the predominating performance, including the multi-view stereo problem. This paper presents a comprehensive review, covering recent deep learning methods for multi-view stereo. These methods are mainly categorized into depth map based and volumetric based methods according to the 3D representation form, and representative methods are reviewed in detail. Specifically, the plane sweep based methods leveraging depth maps are presented following the stage of approaches, i.e. feature extraction, cost volume construction, cost volume regularization, depth map regression and post-processing. This review also summarizes several widely used datasets and their corresponding metrics for evaluation. Finally, several insightful observations and challenges are put forward enlightening future research directions.
engineering, electrical & electronic,instruments & instrumentation,optics,computer science, hardware & architecture
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve
This paper aims to provide a comprehensive review of the recent developments in deep learning-based Multi-View Stereo (MVS) methods. Multi-View Stereo technology infers 3D geometric structures from a set of images taken from multiple known positions and viewpoints, making it one of the most important components in 3D reconstruction. In recent years, due to the excellent performance of deep learning in handling 3D vision problems, more and more research has applied it to the multi-view stereo problem.
Specifically, the paper focuses on the following points:
1. **Method Classification**: According to the 3D representation form, deep learning-based multi-view stereo methods are mainly divided into depth map-based methods and voxel-based methods, and representative methods are introduced in detail.
2. **Technical Process**: Specially introduces the plane-sweep-based methods, including stages such as feature extraction, cost volume construction, cost volume regularization, depth map regression, and post-processing.
3. **Datasets and Evaluation Metrics**: Summarizes several widely used datasets and their corresponding evaluation metrics for evaluating the performance of different methods.
4. **Challenges and Future Directions**: Proposes several valuable observations and challenges, pointing out directions for future research.
### Background
3D reconstruction is a core task of 3D computer vision in various environments, significant for artificial intelligence, and has wide applications in fields such as autonomous driving, virtual reality/augmented reality, and AI robotics. With the development of 3D acquisition technologies, such as LiDAR and other depth sensors becoming reliable, lightweight, and inexpensive, they are widely used in autonomous vehicles, ground robots, and even smartphones. However, the depth maps captured by these sensors are either sparse, missing a lot of details, or limited within a certain depth range, restricting their application in outdoor scenes. Therefore, the demand for dense and detailed 3D reconstruction has driven the development of methods for 3D reconstruction from a series of images, which contain more texture and lighting information, helping to reconstruct fine models.
### Methods
The basic process of multi-view stereo technology includes:
1. **Feature Extraction and Matching**: Extracting and matching features between multiple images to search for correspondences.
2. **Image Registration and Triangulation**: Estimating the external parameters of the camera and performing sparse reconstruction (also known as structure from motion).
3. **Dense 3D Reconstruction**: Using known internal parameters and estimated external parameters to perform dense 3D reconstruction from images.
Among them, depth map-based methods recover 3D models by predicting 2.5D depth maps for each view and then using 3D fusion techniques to merge the depth maps into a coherent 3D model. Voxel-based methods directly predict the occupancy in the 3D voxel space from the input image set as a globally coherent scene representation.
### Contributions of Deep Learning
In recent years, deep neural networks have achieved significant predictive performance in various applications in the field of computer vision, including pedestrian re-identification, image alignment, face alignment, object recognition, and stereo matching. Their strong capabilities in information feature extraction and aggregation have also sparked interest in improving multi-view stereo tasks. Rich image representations help address matching ambiguity issues caused by occlusion, varying lighting conditions, or textureless regions. Thanks to large-scale 3D scene reconstruction datasets, deep learning-based multi-view stereo methods have significantly improved performance compared to traditional methods.
### Main Contributions
1. **First Review**: This is the first review covering the latest developments in deep learning-based multi-view stereo methods, including depth map-based and voxel-based methods.
2. **Detailed Analysis**: Provides a detailed analysis of depth map-based methods, showcasing the main focus of recent work.
3. **Performance Summary**: Summarizes the performance of most methods on mainstream datasets.
### Conclusion
The paper provides a comprehensive review of deep learning-based multi-view stereo methods, with a focus on depth map-based methods, and summarizes the performance on mainstream datasets. Finally, it proposes future research directions, providing guidance for further development in this field.