Abstract:The reconstruction of indoor scenes from multi-view RGB images is challenging due to the coexistence of flat and texture-less regions alongside delicate and fine-grained regions. Recent methods leverage neural radiance fields aided by predicted surface normal priors to recover the scene geometry. These methods excel in producing complete and smooth results for floor and wall areas. However, they struggle to capture complex surfaces with high-frequency structures due to the inadequate neural representation and the inaccurately predicted normal priors. This work aims to reconstruct high-fidelity surfaces with fine-grained details by addressing the above limitations. To improve the capacity of the implicit representation, we propose a hybrid architecture to represent low-frequency and high-frequency regions separately. To enhance the normal priors, we introduce a simple yet effective image sharpening and denoising technique, coupled with a network that estimates the pixel-wise uncertainty of the predicted surface normal vectors. Identifying such uncertainty can prevent our model from being misled by unreliable surface normal supervisions that hinder the accurate reconstruction of intricate geometries. Experiments on the benchmark datasets show that our method outperforms existing methods in terms of reconstruction quality. Furthermore, the proposed method also generalizes well to real-world indoor scenarios captured by our hand-held mobile phones. Our code is publicly available at: <a class="link-external link-https" href="https://github.com/yec22/Fine-Grained-Indoor-Recon" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the challenges encountered in reconstructing indoor scenes from multi - view RGB images, especially how to capture high - fidelity fine - grained details during the reconstruction process. Specifically, although existing methods can well reconstruct large - area smooth regions such as floors and walls, they perform poorly when dealing with complex surfaces and high - frequency structures (such as small objects on the table, fine furniture, etc.). This is because the existing neural implicit representation methods have limitations in terms of expressiveness and the accuracy of the predicted normal priors. To overcome these limitations, the paper proposes a new hybrid representation architecture for representing low - frequency and smooth regions and high - frequency and fine - grained regions respectively. In addition, the paper also introduces an image sharpening and denoising technique to improve the quality of the predicted normal priors and designs an uncertainty module to evaluate the reliability of the predicted normal priors. Through these improvements, the paper aims to improve the fidelity and accuracy of indoor scene reconstruction. ### Main contributions 1. **Hybrid implicit SDF architecture**: It combines MLP and tri - plane representations and can better represent the low - frequency and smooth regions as well as the high - frequency and fine - grained regions of indoor scenes simultaneously. 2. **Normal prior enhancement technique**: It improves the quality of the predicted normal priors through image sharpening and denoising techniques and designs an uncertainty module to evaluate the reliability of the normal priors. 3. **Experimental verification**: Qualitative and quantitative experiments show that this method is superior to existing methods in reconstruction quality and also shows good generalization ability in real - world indoor scenes. ### Formula summary - **Volume rendering equation**: \[ C(r)=\sum_{i = 1}^{N}T_{i}\alpha_{i}f_{c}(r(t_{i}),d),\quad T_{i}=\prod_{j = 1}^{i - 1}(1-\alpha_{j}) \] \[ \alpha_{i}=\max\left(0,\frac{\Phi_{\tau}(f_{g}(r(t_{i})))-\Phi_{\tau}(f_{g}(r(t_{i+1})))}{\Phi_{\tau}(f_{g}(r(t_{i})))}\right) \] where $\Phi_{\tau}$ is the Sigmoid function with learnable parameter $\tau$. - **Eikonal loss**: \[ L_{\text{eik}}=\frac{1}{N}\sum_{i = 1}^{N}(\|\nabla s_{i}\|^{2}-1)^{2} \] - **RGB color loss**: \[ L_{\text{rgb}}=\frac{1}{|R|}\sum_{r\in R}\|C(r)-\hat{C}(r)\|_{1} \] - **Normal prior loss**: \[ L_{\text{prior}}=\frac{1}{|R|}\sum_{r\in R}(1 - u(r))\left|1 - n(r)^{\top}\hat{n}(r)\right| \] Through these techniques and methods, the paper successfully improves the precision and detail performance of indoor scene reconstruction, providing strong support for practical applications.

Indoor Scene Reconstruction with Fine-Grained Details Using Hybrid Representation and Normal Prior Enhancement

NeuralRoom: Geometry-Constrained Neural Implicit Surfaces for Indoor Scene Reconstruction.

Fine-detailed Neural Indoor Scene Reconstruction using multi-level importance sampling and multi-view consistency

NeuralRoom

Three-Dimensional Reconstruction of Indoor Scenes Based on Implicit Neural Representation

Improving Neural Indoor Surface Reconstruction with Mask-Guided Adaptive Consistency Constraints

Scalable Neural Indoor Scene Rendering.

Neural 3D Scene Reconstruction with Indoor Planar Priors

Real-time indoor scene reconstruction with Manhattan assumption

Reconstruction of Indoor Scene from A Single Image

I2-SDF: Intrinsic Indoor Scene Reconstruction and Editing Via Raytracing in Neural SDFs

DebSDF: Delving into the Details and Bias of Neural Indoor Scene Reconstruction

A New Era of Indoor Scene Reconstruction: A Survey

Learning to Reconstruct and Understand Indoor Scenes from Sparse Views

NopeRoom: Geometric Prior Based Indoor Scene Reconstruction with Unknown Poses

Sparis: Neural Implicit Surface Reconstruction of Indoor Scenes from Sparse Views

ND-SDF: Learning Normal Deflection Fields for High-Fidelity Indoor Reconstruction

Indoor Scene Reconstruction From Monocular Video Combining Contextual and Geometric Priors

NC-SDF: Enhancing Indoor Scene Reconstruction Using Neural SDFs with View-Dependent Normal Compensation

Fast and Automatic Reconstruction of Semantically Rich 3D Indoor Maps from Low-quality RGB-D Sequences

P$^2$SDF for Neural Indoor Scene Reconstruction