Abstract:The precise reconstruction of 3D objects from a single RGB image in complex scenes presents a critical challenge in virtual reality, autonomous driving, and robotics. Existing neural implicit 3D representation methods face significant difficulties in balancing the extraction of global and local features, particularly in diverse and complex environments, leading to insufficient reconstruction precision and quality. We propose M3D, a novel single-view 3D reconstruction framework, to tackle these challenges. This framework adopts a dual-stream feature extraction strategy based on Selective State Spaces to effectively balance the extraction of global and local features, thereby improving scene comprehension and representation precision. Additionally, a parallel branch extracts depth information, effectively integrating visual and geometric features to enhance reconstruction quality and preserve intricate details. Experimental results indicate that the fusion of multi-scale features with depth information via the dual-branch feature extraction significantly boosts geometric consistency and fidelity, achieving state-of-the-art reconstruction performance.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to address the challenge of accurately reconstructing 3D objects from a single RGB image, especially in complex - scene applications. Specifically, the paper proposes improvement schemes for the following problems: 1. **Balance between global and local feature extraction**: - Existing neural implicit 3D representation methods have difficulty in balancing the extraction of global and local features in complex and diverse environments, resulting in insufficient reconstruction accuracy and quality. - Convolutional neural networks (CNNs) are good at extracting local features, but due to their limited receptive fields, they are difficult to capture the global context, resulting in incomplete or distorted geometric structures in complex scenes. - Transformer - based architectures can effectively capture long - distance dependencies, but when reconstructing objects with complex geometric structures, they often fail to model fine local details. 2. **Integration of depth information**: - In single - view 3D reconstruction, the lack of depth information is a key issue. Introducing depth information can help resolve the ambiguity in occluded areas and improve the reconstruction quality by enhancing geometric consistency. - Existing methods usually jointly extract RGB and depth features in one stream, which may lead to feature interference and affect spatial consistency. ### Proposed solutions To solve the above problems, the authors propose the M3D framework, which is a novel framework for high - fidelity single - view 3D reconstruction. The main innovations of M3D include: - **Two - stream feature extraction strategy**: Adopt a two - stream feature extraction strategy based on Selective State Spaces (SSM), which effectively balances the extraction of global and local features, thereby improving scene understanding and representation accuracy. \[ F_{\text{long}}=\text{SSM}(F_{\text{roi},1}), \quad F_{\text{short}}=\text{CNN}(F_{\text{roi},2}) \] where \(F_{\text{roi},1}\) and \(F_{\text{roi},2}\) are channel features segmented from \(F_{\text{roi}}\). - **Depth - driven branch**: A parallel branch extracts depth information, effectively combining visual and geometric features and enhancing the reconstruction quality and detail preservation. \[ F_{\text{3D}}=\text{Bilinear}(F_{\text{dep}}\oplus F_{\text{highD}}, P_{2D}) \] where \(\oplus\) represents the generalized addition operation between depth and RGB - derived features. - **Selective attention module**: Introduce a Selective Attention Module, which combines long - distance context and local features to generate a spatially consistent representation, balancing local details and global context. \[ F_{\text{context}}=\text{SelfAttention}(\text{MLP}(F_{\text{long}}+F_{\text{short}})) \] - **Implicit geometric representation and decoding module**: Use a neural implicit representation based on the Signed Distance Function (SDF) to encode 3D geometric shapes and decode them into high - fidelity 3D structures through volume rendering techniques. \[ \text{SDF}(x) = s, \quad s\in\mathbb{R} \] where \(x\in\mathbb{R}^3\) is a given 3D point, and \(s\) is the distance to the nearest surface, and the sign indicates whether the point is inside or outside the surface. Through these innovations, the M3D framework achieves higher geometric consistency and fidelity in complex scenes, significantly outperforming existing methods. Experimental results show that M3D performs well in Chamfer Distance (CD), F - scor

M3D: Dual-Stream Selective State Spaces and Depth-Driven Framework for High-Fidelity Single-View 3D Reconstruction

Hybrid-MVS: Robust Multi-View Reconstruction with Hybrid Optimization of Visual and Depth Cues

Multi-View Depth Map Sampling for 3D Reconstruction of Natural Scene

DP-MVS: Detail Preserving Multi-View Surface Reconstruction of Large-Scale Scenes

Robust 3D Reconstruction with an RGB-D Camera

Single-view 3D Scene Reconstruction with High-fidelity Shape and Texture

Multi-View Stereo Representation Revist: Region-Aware MVSNet

2L3: Lifting Imperfect Generated 2D Images into Accurate 3D

Multi-view depth estimation based on multi-feature aggregation for 3D reconstruction

Deep learning based multi-view stereo matching and 3D scene reconstruction from oblique aerial images

Enhanced multi view 3D reconstruction with improved MVSNet

MVSBoost: An Efficient Point Cloud-based 3D Reconstruction

EPP-MVSNet: Epipolar-assembling based Depth Prediction for Multi-view Stereo

Data-Driven 3D Reconstruction of Dressed Humans From Sparse Views

Attention Aware Cost Volume Pyramid Based Multi-view Stereo Network for 3D Reconstruction

VI3DRM:Towards meticulous 3D Reconstruction from Sparse Views via Photo-Realistic Novel View Synthesis

Incremental Dense Reconstruction from Monocular Video with Guided Sparse Feature Volume Fusion

HC-MVSNet: A Probability Sampling-Based Multi-View-stereo Network with Hybrid Cascade Structure for 3D Reconstruction

SimpleRecon: 3D Reconstruction Without 3D Convolutions

FineRecon: Depth-aware Feed-forward Network for Detailed 3D Reconstruction

DUSt3R: Geometric 3D Vision Made Easy