M3D: Dual-Stream Selective State Spaces and Depth-Driven Framework for High-Fidelity Single-View 3D Reconstruction

Luoxi Zhang,Pragyan Shrestha,Yu Zhou,Chun Xie,Itaru Kitahara
2024-11-20
Abstract:The precise reconstruction of 3D objects from a single RGB image in complex scenes presents a critical challenge in virtual reality, autonomous driving, and robotics. Existing neural implicit 3D representation methods face significant difficulties in balancing the extraction of global and local features, particularly in diverse and complex environments, leading to insufficient reconstruction precision and quality. We propose M3D, a novel single-view 3D reconstruction framework, to tackle these challenges. This framework adopts a dual-stream feature extraction strategy based on Selective State Spaces to effectively balance the extraction of global and local features, thereby improving scene comprehension and representation precision. Additionally, a parallel branch extracts depth information, effectively integrating visual and geometric features to enhance reconstruction quality and preserve intricate details. Experimental results indicate that the fusion of multi-scale features with depth information via the dual-branch feature extraction significantly boosts geometric consistency and fidelity, achieving state-of-the-art reconstruction performance.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to address the challenge of accurately reconstructing 3D objects from a single RGB image, especially in complex - scene applications. Specifically, the paper proposes improvement schemes for the following problems: 1. **Balance between global and local feature extraction**: - Existing neural implicit 3D representation methods have difficulty in balancing the extraction of global and local features in complex and diverse environments, resulting in insufficient reconstruction accuracy and quality. - Convolutional neural networks (CNNs) are good at extracting local features, but due to their limited receptive fields, they are difficult to capture the global context, resulting in incomplete or distorted geometric structures in complex scenes. - Transformer - based architectures can effectively capture long - distance dependencies, but when reconstructing objects with complex geometric structures, they often fail to model fine local details. 2. **Integration of depth information**: - In single - view 3D reconstruction, the lack of depth information is a key issue. Introducing depth information can help resolve the ambiguity in occluded areas and improve the reconstruction quality by enhancing geometric consistency. - Existing methods usually jointly extract RGB and depth features in one stream, which may lead to feature interference and affect spatial consistency. ### Proposed solutions To solve the above problems, the authors propose the M3D framework, which is a novel framework for high - fidelity single - view 3D reconstruction. The main innovations of M3D include: - **Two - stream feature extraction strategy**: Adopt a two - stream feature extraction strategy based on Selective State Spaces (SSM), which effectively balances the extraction of global and local features, thereby improving scene understanding and representation accuracy. \[ F_{\text{long}}=\text{SSM}(F_{\text{roi},1}), \quad F_{\text{short}}=\text{CNN}(F_{\text{roi},2}) \] where \(F_{\text{roi},1}\) and \(F_{\text{roi},2}\) are channel features segmented from \(F_{\text{roi}}\). - **Depth - driven branch**: A parallel branch extracts depth information, effectively combining visual and geometric features and enhancing the reconstruction quality and detail preservation. \[ F_{\text{3D}}=\text{Bilinear}(F_{\text{dep}}\oplus F_{\text{highD}}, P_{2D}) \] where \(\oplus\) represents the generalized addition operation between depth and RGB - derived features. - **Selective attention module**: Introduce a Selective Attention Module, which combines long - distance context and local features to generate a spatially consistent representation, balancing local details and global context. \[ F_{\text{context}}=\text{SelfAttention}(\text{MLP}(F_{\text{long}}+F_{\text{short}})) \] - **Implicit geometric representation and decoding module**: Use a neural implicit representation based on the Signed Distance Function (SDF) to encode 3D geometric shapes and decode them into high - fidelity 3D structures through volume rendering techniques. \[ \text{SDF}(x) = s, \quad s\in\mathbb{R} \] where \(x\in\mathbb{R}^3\) is a given 3D point, and \(s\) is the distance to the nearest surface, and the sign indicates whether the point is inside or outside the surface. Through these innovations, the M3D framework achieves higher geometric consistency and fidelity in complex scenes, significantly outperforming existing methods. Experimental results show that M3D performs well in Chamfer Distance (CD), F - scor