Sp2360: Sparse-view 360 Scene Reconstruction using Cascaded 2D Diffusion Priors

Soumava Paul,Christopher Wewer,Bernt Schiele,Jan Eric Lenssen
2024-06-03
Abstract:We aim to tackle sparse-view reconstruction of a 360 3D scene using priors from latent diffusion models (LDM). The sparse-view setting is ill-posed and underconstrained, especially for scenes where the camera rotates 360 degrees around a point, as no visual information is available beyond some frontal views focused on the central object(s) of interest. In this work, we show that pretrained 2D diffusion models can strongly improve the reconstruction of a scene with low-cost fine-tuning. Specifically, we present SparseSplat360 (Sp2360), a method that employs a cascade of in-painting and artifact removal models to fill in missing details and clean novel views. Due to superior training and rendering speeds, we use an explicit scene representation in the form of 3D Gaussians over NeRF-based implicit representations. We propose an iterative update strategy to fuse generated pseudo novel views with existing 3D Gaussians fitted to the initial sparse inputs. As a result, we obtain a multi-view consistent scene representation with details coherent with the observed inputs. Our evaluation on the challenging Mip-NeRF360 dataset shows that our proposed 2D to 3D distillation algorithm considerably improves the performance of a regularized version of 3DGS adapted to a sparse-view setting and outperforms existing sparse-view reconstruction methods in 360 scene reconstruction. Qualitatively, our method generates entire 360 scenes from as few as 9 input views, with a high degree of foreground and background detail.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to reconstruct a 360 - degree panoramic scene under sparse viewpoints. Specifically, when a camera rotates 360 degrees around a point, no other visual information is available except for some forward - facing views, which makes the reconstruction problem under sparse viewpoints very difficult and under - constrained. This paper proposes a method named SparseSplat360 (Sp2360), which uses a pre - trained 2D diffusion model to significantly improve the scene reconstruction effect with a small amount of fine - tuning. ### Main Contributions 1. **A New Method for Sparse 3D Reconstruction**: - A new systematic method is proposed to achieve 3D reconstruction of sparse 360 - degree panoramic scenes by adding generated new viewpoints to the training set in an autoregressive manner. 2. **Two - Step Generation of New Training Viewpoints**: - A two - step method for generating new training viewpoints is introduced. Image completion and denoising are carried out through a 2D diffusion model, avoiding the need for fine - tuning on large - scale 3D data. 3. **Sparse 3DGS Baseline Method**: - A sparse 3DGS baseline method is introduced. By applying regularization techniques, the reconstruction effect from sparse observations is improved without the need for a pre - trained model. 4. **Superior Performance**: - The experimental results show that Sp2360 can outperform existing methods in reconstructing large - scale 3D scenes when only using 9 input viewpoints. ### Method Overview 1. **Initial 3D Gaussian Representation Optimization**: - The initial sparse input image set is optimized using the sparse 3DGS baseline method to obtain a 3D Gaussian representation as an initial prior. 2. **Autoregressive Generation of New Viewpoints**: - New viewpoints are iteratively generated through the following steps: 1. Sample new camera viewpoints and render new viewpoints with artifacts and missing regions. 2. Use the image completion module to repair the missing regions. 3. Use the denoising module to remove artifacts. 4. Add the generated new viewpoints to the training set and continue to optimize the 3D representation. ### Key Technologies 1. **Image Completion Module**: - The pre - trained Stable Diffusion 2 model is used for image completion. The missing regions are indicated by binary masks, and it is adapted to the current scene through fine - tuning. 2. **Denoising Module**: - A conditional diffusion model is used to learn how to detect and remove typical artifacts in the 3D Gaussian representation, and it is fine - tuned through a synthetic dataset. ### Experimental Results - **Quantitative Comparison**: - The experimental results on the MipNeRF360 dataset show that Sp2360 outperforms all baseline methods in all metrics for 9 - viewpoint reconstruction, and is second only to DiffusioNeRF in 3 - viewpoint and 6 - viewpoint reconstructions. - **Ablation Study**: - Through ablation studies of different components, the contributions of the image completion and denoising modules to the final reconstruction quality are verified, especially performing best in generating new viewpoints and the final reconstruction. ### Conclusion Sp2360 effectively solves the problem of 360 - degree panoramic scene reconstruction under sparse viewpoints by combining the strong prior of the 2D diffusion model and the method of autoregressive generation of new viewpoints, demonstrating the ability to generate high - quality 3D reconstructions with a small number of input viewpoints.