Abstract:This paper addresses a challenging question: How can we efficiently create high-quality, wide-scope 3D scenes from a single arbitrary image? Existing methods face several constraints, such as requiring multi-view data, time-consuming per-scene optimization, low visual quality in backgrounds, and distorted reconstructions in unseen areas. We propose a novel pipeline to overcome these limitations. Specifically, we introduce a large-scale reconstruction model that uses latents from a video diffusion model to predict 3D Gaussian Splattings for the scenes in a feed-forward manner. The video diffusion model is designed to create videos precisely following specified camera trajectories, allowing it to generate compressed video latents that contain multi-view information while maintaining 3D consistency. We train the 3D reconstruction model to operate on the video latent space with a progressive training strategy, enabling the efficient generation of high-quality, wide-scope, and generic 3D scenes. Extensive evaluations across various datasets demonstrate that our model significantly outperforms existing methods for single-view 3D scene generation, particularly with out-of-domain images. For the first time, we demonstrate that a 3D reconstruction model can be effectively built upon the latent space of a diffusion model to realize efficient 3D scene generation.

What problem does this paper attempt to address?

This paper attempts to solve the problem of efficiently generating high - quality, large - scale 3D scenes from a single image. Specifically, existing methods have the following limitations: 1. **Requirement for multi - view data**: Many methods need multi - view images for training. 2. **Time - consuming per - scene optimization**: Traditional methods usually require time - consuming optimization for each scene. 3. **Low background visual quality**: The reconstruction quality in the background area is low. 4. **Distortion in unseen areas**: In unseen areas, the reconstruction results may be distorted. To solve these problems, the author proposes a new pipeline method, "Wonderland", which uses the latent space of the video diffusion model to predict 3D Gaussian point clouds, thereby reconstructing 3D scenes in a feed - forward manner. This method not only reduces the need for multi - view data but also significantly improves the reconstruction efficiency and quality. ### Specific improvements 1. **Introduction of the video diffusion model**: - The video diffusion model can generate videos following a specified camera trajectory, thereby generating compressed video latents that contain multi - view information and maintain 3D consistency. 2. **Dual - branch camera - conditioning mechanism**: - By introducing a dual - branch camera - conditioning mechanism, the video diffusion model can precisely control the specified camera movement, thereby expanding from a single image to multi - view - consistent 3D scene capture. 3. **Latent - space - based large - scale reconstruction model (LaLRM)**: - LaLRM directly converts video latents into 3D Gaussian point clouds, greatly accelerating the reconstruction process and significantly reducing the memory requirements. Compared with reconstructing scenes from images, the video latent space provides 256 - fold spatio - temporal compression while retaining important 3D structural details. ### Experimental verification Through extensive evaluation on multiple benchmark datasets (such as RealEstate10K, DL3DV, Tanks - and - Temples, etc.), it is proved that this method achieves state - of - the - art performance in generating 3D scenes under single - view conditions, especially in the zero - shot new - view synthesis task. In summary, the main contribution of this paper is to propose a novel method that can efficiently generate high - quality, large - scale 3D scenes from a single image and solve several limitations in existing methods.

Wonderland: Navigating 3D Scenes from a Single Image

ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model

World-consistent Video Diffusion with Explicit 3D Modeling

V3D: Video Diffusion Models are Effective 3D Generators

4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models

DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion

WonderWorld: Interactive 3D Scene Generation from a Single Image

Sampling 3D Gaussian Scenes in Seconds with Latent Diffusion Models

HoloDreamer: Holistic 3D Panoramic World Generation from Text Descriptions

Make-It-4D: Synthesizing a Consistent Long-Term Dynamic Scene Video from a Single Image

Diffusion$^2$: Dynamic 3D Content Generation via Score Composition of Video and Multi-view Diffusion Models

3DGS-Enhancer: Enhancing Unbounded 3D Gaussian Splatting with View-consistent 2D Diffusion Priors

SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion

DreamScene360: Unconstrained Text-to-3D Scene Generation with Panoramic Gaussian Splatting

Wonder3D: Single Image to 3D Using Cross-Domain Diffusion

Envision3D: One Image to 3D with Anchor Views Interpolation

Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion Models

Generating 3D-Consistent Videos from Unposed Internet Photos

VistaDream: Sampling multiview consistent images for single-view scene reconstruction

Guess The Unseen: Dynamic 3D Scene Reconstruction from Partial 2D Glimpses