Wonderland: Navigating 3D Scenes from a Single Image

Hanwen Liang,Junli Cao,Vidit Goel,Guocheng Qian,Sergei Korolev,Demetri Terzopoulos,Konstantinos N. Plataniotis,Sergey Tulyakov,Jian Ren
2024-12-17
Abstract:This paper addresses a challenging question: How can we efficiently create high-quality, wide-scope 3D scenes from a single arbitrary image? Existing methods face several constraints, such as requiring multi-view data, time-consuming per-scene optimization, low visual quality in backgrounds, and distorted reconstructions in unseen areas. We propose a novel pipeline to overcome these limitations. Specifically, we introduce a large-scale reconstruction model that uses latents from a video diffusion model to predict 3D Gaussian Splattings for the scenes in a feed-forward manner. The video diffusion model is designed to create videos precisely following specified camera trajectories, allowing it to generate compressed video latents that contain multi-view information while maintaining 3D consistency. We train the 3D reconstruction model to operate on the video latent space with a progressive training strategy, enabling the efficient generation of high-quality, wide-scope, and generic 3D scenes. Extensive evaluations across various datasets demonstrate that our model significantly outperforms existing methods for single-view 3D scene generation, particularly with out-of-domain images. For the first time, we demonstrate that a 3D reconstruction model can be effectively built upon the latent space of a diffusion model to realize efficient 3D scene generation.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve the problem of efficiently generating high - quality, large - scale 3D scenes from a single image. Specifically, existing methods have the following limitations: 1. **Requirement for multi - view data**: Many methods need multi - view images for training. 2. **Time - consuming per - scene optimization**: Traditional methods usually require time - consuming optimization for each scene. 3. **Low background visual quality**: The reconstruction quality in the background area is low. 4. **Distortion in unseen areas**: In unseen areas, the reconstruction results may be distorted. To solve these problems, the author proposes a new pipeline method, "Wonderland", which uses the latent space of the video diffusion model to predict 3D Gaussian point clouds, thereby reconstructing 3D scenes in a feed - forward manner. This method not only reduces the need for multi - view data but also significantly improves the reconstruction efficiency and quality. ### Specific improvements 1. **Introduction of the video diffusion model**: - The video diffusion model can generate videos following a specified camera trajectory, thereby generating compressed video latents that contain multi - view information and maintain 3D consistency. 2. **Dual - branch camera - conditioning mechanism**: - By introducing a dual - branch camera - conditioning mechanism, the video diffusion model can precisely control the specified camera movement, thereby expanding from a single image to multi - view - consistent 3D scene capture. 3. **Latent - space - based large - scale reconstruction model (LaLRM)**: - LaLRM directly converts video latents into 3D Gaussian point clouds, greatly accelerating the reconstruction process and significantly reducing the memory requirements. Compared with reconstructing scenes from images, the video latent space provides 256 - fold spatio - temporal compression while retaining important 3D structural details. ### Experimental verification Through extensive evaluation on multiple benchmark datasets (such as RealEstate10K, DL3DV, Tanks - and - Temples, etc.), it is proved that this method achieves state - of - the - art performance in generating 3D scenes under single - view conditions, especially in the zero - shot new - view synthesis task. In summary, the main contribution of this paper is to propose a novel method that can efficiently generate high - quality, large - scale 3D scenes from a single image and solve several limitations in existing methods.