Abstract:Advancements in 3D scene reconstruction have transformed 2D images from the real world into 3D models, producing realistic 3D results from hundreds of input photos. Despite great success in dense-view reconstruction scenarios, rendering a detailed scene from insufficient captured views is still an ill-posed optimization problem, often resulting in artifacts and distortions in unseen areas. In this paper, we propose ReconX, a novel 3D scene reconstruction paradigm that reframes the ambiguous reconstruction challenge as a temporal generation task. The key insight is to unleash the strong generative prior of large pre-trained video diffusion models for sparse-view reconstruction. However, 3D view consistency struggles to be accurately preserved in directly generated video frames from pre-trained models. To address this, given limited input views, the proposed ReconX first constructs a global point cloud and encodes it into a contextual space as the 3D structure condition. Guided by the condition, the video diffusion model then synthesizes video frames that are both detail-preserved and exhibit a high degree of 3D consistency, ensuring the coherence of the scene from various perspectives. Finally, we recover the 3D scene from the generated video through a confidence-aware 3D Gaussian Splatting optimization scheme. Extensive experiments on various real-world datasets show the superiority of our ReconX over state-of-the-art methods in terms of quality and generalizability.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper "RECON X: Reconstructing Any Scene from Sparse Views Using Video Diffusion Models" aims to address the problem of reconstructing high-quality 3D scenes from sparse views (e.g., only 2 images). Although existing 3D reconstruction techniques have achieved significant success in dense view scenarios, reconstructing detailed scenes with insufficient views remains an unresolved optimization problem, often leading to artifacts and distortions in unseen areas. ### Main Contributions 1. **Proposing the ReconX Framework**: Redefines the challenge of blurry reconstruction as a temporal generation task, utilizing pre-trained large-scale video diffusion models to generate more observational data, thereby improving the quality of 3D reconstruction. 2. **3D Structure Guidance**: Integrates 3D structural information into the conditional space of the video diffusion model to generate frames with 3D consistency and proposes a 3D confidence-aware optimization scheme for reconstructing scenes from the generated videos. 3. **Experimental Validation**: Extensive experiments demonstrate that ReconX outperforms existing methods on various real-world datasets, particularly excelling in high fidelity and generalization capabilities. ### Method Overview 1. **Constructing 3D Structure Guidance**: Uses the unconstrained stereo 3D reconstruction method DUSt3R to construct a global point cloud from sparse views and projects it into the 3D context representation space as structural guidance. 2. **3D Consistent Video Frame Generation**: Injects 3D structure guidance into the video diffusion process to generate 3D consistent video frames, increasing observational data. 3. **Confidence-Aware 3DGS Optimization**: Utilizes the generated video frames and confidence maps to reconstruct 3D scenes through a 3D Gaussian point cloud optimization scheme, further reducing uncertainty. ### Experimental Results - **Quantitative Comparison**: On the RealEstate10K and ACID datasets, ReconX outperforms baseline methods in metrics such as PSNR, SSIM, and LPIPS. - **Qualitative Comparison**: Under different viewpoint changes and cross-dataset generalization settings, the images generated by ReconX exhibit superior visual quality and accuracy. ### Conclusion ReconX successfully generates high-quality 3D scenes from sparse views by transforming the 3D reconstruction problem into a generation problem and leveraging large-scale pre-trained video diffusion models. This approach is not only technically innovative but also has broad potential in practical applications.

ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model

SparseFusion: Distilling View-conditioned Diffusion for 3D Reconstruction

Sparse3D: Distilling Multiview-Consistent Diffusion for Object Reconstruction from Sparse Views

ReconFusion: 3D Reconstruction with Diffusion Priors

VisFusion: Visibility-aware Online 3D Scene Reconstruction from Videos

VI3DRM:Towards meticulous 3D Reconstruction from Sparse Views via Photo-Realistic Novel View Synthesis

MVDiffusion++: A Dense High-resolution Multi-view Diffusion Model for Single or Sparse-view 3D Object Reconstruction

Pragmatist: Multiview Conditional Diffusion Models for High-Fidelity 3D Reconstruction from Unposed Sparse Views

ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

NeuralRecon: Real-Time Coherent 3D Scene Reconstruction from Monocular Video

Single-view 3D Scene Reconstruction with High-fidelity Shape and Texture

FrozenRecon: Pose-free 3D Scene Reconstruction with Frozen Depth Models

PVP-Recon: Progressive View Planning Via Warping Consistency for Sparse-View Surface Reconstruction

Towards 3D Scene Reconstruction from Locally Scale-Aligned Monocular Video Depth

Deceptive-NeRF/3DGS: Diffusion-Generated Pseudo-Observations for High-Quality Sparse-View Reconstruction

MV-DUSt3R+: Single-Stage Scene Reconstruction from Sparse Views In 2 Seconds

Data-Driven 3D Reconstruction of Dressed Humans From Sparse Views

DRSM: efficient neural 4d decomposition for dynamic reconstruction in stationary monocular cameras

SCube: Instant Large-Scale Scene Reconstruction using VoxSplats

O$^2$-Recon: Completing 3D Reconstruction of Occluded Objects in the Scene with a Pre-trained 2D Diffusion Model

DrivingRecon: Large 4D Gaussian Reconstruction Model For Autonomous Driving