Indoor Scene Reconstruction with Fine-Grained Details Using Hybrid Representation and Normal Prior Enhancement

Sheng Ye,Yubin Hu,Matthieu Lin,Yu-Hui Wen,Wang Zhao,Yong-Jin Liu,Wenping Wang
2024-08-13
Abstract:The reconstruction of indoor scenes from multi-view RGB images is challenging due to the coexistence of flat and texture-less regions alongside delicate and fine-grained regions. Recent methods leverage neural radiance fields aided by predicted surface normal priors to recover the scene geometry. These methods excel in producing complete and smooth results for floor and wall areas. However, they struggle to capture complex surfaces with high-frequency structures due to the inadequate neural representation and the inaccurately predicted normal priors. This work aims to reconstruct high-fidelity surfaces with fine-grained details by addressing the above limitations. To improve the capacity of the implicit representation, we propose a hybrid architecture to represent low-frequency and high-frequency regions separately. To enhance the normal priors, we introduce a simple yet effective image sharpening and denoising technique, coupled with a network that estimates the pixel-wise uncertainty of the predicted surface normal vectors. Identifying such uncertainty can prevent our model from being misled by unreliable surface normal supervisions that hinder the accurate reconstruction of intricate geometries. Experiments on the benchmark datasets show that our method outperforms existing methods in terms of reconstruction quality. Furthermore, the proposed method also generalizes well to real-world indoor scenarios captured by our hand-held mobile phones. Our code is publicly available at: <a class="link-external link-https" href="https://github.com/yec22/Fine-Grained-Indoor-Recon" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve the challenges encountered in reconstructing indoor scenes from multi - view RGB images, especially how to capture high - fidelity fine - grained details during the reconstruction process. Specifically, although existing methods can well reconstruct large - area smooth regions such as floors and walls, they perform poorly when dealing with complex surfaces and high - frequency structures (such as small objects on the table, fine furniture, etc.). This is because the existing neural implicit representation methods have limitations in terms of expressiveness and the accuracy of the predicted normal priors. To overcome these limitations, the paper proposes a new hybrid representation architecture for representing low - frequency and smooth regions and high - frequency and fine - grained regions respectively. In addition, the paper also introduces an image sharpening and denoising technique to improve the quality of the predicted normal priors and designs an uncertainty module to evaluate the reliability of the predicted normal priors. Through these improvements, the paper aims to improve the fidelity and accuracy of indoor scene reconstruction. ### Main contributions 1. **Hybrid implicit SDF architecture**: It combines MLP and tri - plane representations and can better represent the low - frequency and smooth regions as well as the high - frequency and fine - grained regions of indoor scenes simultaneously. 2. **Normal prior enhancement technique**: It improves the quality of the predicted normal priors through image sharpening and denoising techniques and designs an uncertainty module to evaluate the reliability of the normal priors. 3. **Experimental verification**: Qualitative and quantitative experiments show that this method is superior to existing methods in reconstruction quality and also shows good generalization ability in real - world indoor scenes. ### Formula summary - **Volume rendering equation**: \[ C(r)=\sum_{i = 1}^{N}T_{i}\alpha_{i}f_{c}(r(t_{i}),d),\quad T_{i}=\prod_{j = 1}^{i - 1}(1-\alpha_{j}) \] \[ \alpha_{i}=\max\left(0,\frac{\Phi_{\tau}(f_{g}(r(t_{i})))-\Phi_{\tau}(f_{g}(r(t_{i+1})))}{\Phi_{\tau}(f_{g}(r(t_{i})))}\right) \] where $\Phi_{\tau}$ is the Sigmoid function with learnable parameter $\tau$. - **Eikonal loss**: \[ L_{\text{eik}}=\frac{1}{N}\sum_{i = 1}^{N}(\|\nabla s_{i}\|^{2}-1)^{2} \] - **RGB color loss**: \[ L_{\text{rgb}}=\frac{1}{|R|}\sum_{r\in R}\|C(r)-\hat{C}(r)\|_{1} \] - **Normal prior loss**: \[ L_{\text{prior}}=\frac{1}{|R|}\sum_{r\in R}(1 - u(r))\left|1 - n(r)^{\top}\hat{n}(r)\right| \] Through these techniques and methods, the paper successfully improves the precision and detail performance of indoor scene reconstruction, providing strong support for practical applications.