ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model

Fangfu Liu,Wenqiang Sun,Hanyang Wang,Yikai Wang,Haowen Sun,Junliang Ye,Jun Zhang,Yueqi Duan
2024-08-30
Abstract:Advancements in 3D scene reconstruction have transformed 2D images from the real world into 3D models, producing realistic 3D results from hundreds of input photos. Despite great success in dense-view reconstruction scenarios, rendering a detailed scene from insufficient captured views is still an ill-posed optimization problem, often resulting in artifacts and distortions in unseen areas. In this paper, we propose ReconX, a novel 3D scene reconstruction paradigm that reframes the ambiguous reconstruction challenge as a temporal generation task. The key insight is to unleash the strong generative prior of large pre-trained video diffusion models for sparse-view reconstruction. However, 3D view consistency struggles to be accurately preserved in directly generated video frames from pre-trained models. To address this, given limited input views, the proposed ReconX first constructs a global point cloud and encodes it into a contextual space as the 3D structure condition. Guided by the condition, the video diffusion model then synthesizes video frames that are both detail-preserved and exhibit a high degree of 3D consistency, ensuring the coherence of the scene from various perspectives. Finally, we recover the 3D scene from the generated video through a confidence-aware 3D Gaussian Splatting optimization scheme. Extensive experiments on various real-world datasets show the superiority of our ReconX over state-of-the-art methods in terms of quality and generalizability.
Computer Vision and Pattern Recognition,Artificial Intelligence,Graphics
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper "RECON X: Reconstructing Any Scene from Sparse Views Using Video Diffusion Models" aims to address the problem of reconstructing high-quality 3D scenes from sparse views (e.g., only 2 images). Although existing 3D reconstruction techniques have achieved significant success in dense view scenarios, reconstructing detailed scenes with insufficient views remains an unresolved optimization problem, often leading to artifacts and distortions in unseen areas. ### Main Contributions 1. **Proposing the ReconX Framework**: Redefines the challenge of blurry reconstruction as a temporal generation task, utilizing pre-trained large-scale video diffusion models to generate more observational data, thereby improving the quality of 3D reconstruction. 2. **3D Structure Guidance**: Integrates 3D structural information into the conditional space of the video diffusion model to generate frames with 3D consistency and proposes a 3D confidence-aware optimization scheme for reconstructing scenes from the generated videos. 3. **Experimental Validation**: Extensive experiments demonstrate that ReconX outperforms existing methods on various real-world datasets, particularly excelling in high fidelity and generalization capabilities. ### Method Overview 1. **Constructing 3D Structure Guidance**: Uses the unconstrained stereo 3D reconstruction method DUSt3R to construct a global point cloud from sparse views and projects it into the 3D context representation space as structural guidance. 2. **3D Consistent Video Frame Generation**: Injects 3D structure guidance into the video diffusion process to generate 3D consistent video frames, increasing observational data. 3. **Confidence-Aware 3DGS Optimization**: Utilizes the generated video frames and confidence maps to reconstruct 3D scenes through a 3D Gaussian point cloud optimization scheme, further reducing uncertainty. ### Experimental Results - **Quantitative Comparison**: On the RealEstate10K and ACID datasets, ReconX outperforms baseline methods in metrics such as PSNR, SSIM, and LPIPS. - **Qualitative Comparison**: Under different viewpoint changes and cross-dataset generalization settings, the images generated by ReconX exhibit superior visual quality and accuracy. ### Conclusion ReconX successfully generates high-quality 3D scenes from sparse views by transforming the 3D reconstruction problem into a generation problem and leveraging large-scale pre-trained video diffusion models. This approach is not only technically innovative but also has broad potential in practical applications.