Abstract:We present a latent diffusion model over 3D scenes, that can be trained using only 2D image data. To achieve this, we first design an autoencoder that maps multi-view images to 3D Gaussian splats, and simultaneously builds a compressed latent representation of these splats. Then, we train a multi-view diffusion model over the latent space to learn an efficient generative model. This pipeline does not require object masks nor depths, and is suitable for complex scenes with arbitrary camera positions. We conduct careful experiments on two large-scale datasets of complex real-world scenes -- MVImgNet and RealEstate10K. We show that our approach enables generating 3D scenes in as little as 0.2 seconds, either from scratch, from a single input view, or from sparse input views. It produces diverse and high-quality results while running an order of magnitude faster than non-latent diffusion models and earlier NeRF-based generative models

What problem does this paper attempt to address?

The paper aims to address the following main issues: ### Core Issues - **Efficient Generation of 3D Scenes**: Design a method that can quickly generate realistic 3D scenes using only 2D image data for training. ### Specific Challenges and Goals 1. **Lack of Large-Scale 3D Scene Datasets**: Existing 3D datasets are either too small to train generative models or contain isolated objects rather than complete scenes. 2. **Learning 3D Generative Models Directly from Multi-View Images**: Utilize large-scale datasets of existing multi-view images to learn 3D generative models without relying on 3D datasets. 3. **Efficient Sampling Process**: Current 3D generative models are very slow during sampling because they require expensive volumetric rendering operations. 4. **Handling Incomplete Observations of Scenes**: Reasonably infer information about areas of the scene that are not observed by multiple images. ### Solution Overview - **Propose a method based on latent space diffusion models** that can generate high-quality 3D scenes within seconds. - **Use Gaussian Splats as the 3D representation**, combined with an autoencoder to construct a compressed latent space representation. - **Train a denoising diffusion model in the latent space** to achieve efficient and high-quality 3D scene generation. - **Support multiple tasks**, including unconditional generation, single-image 3D reconstruction, and sparse-view 3D reconstruction. ### Main Contributions 1. **First to achieve a generative model for real-world scene distributions based on Gaussian Splats**. 2. **Design a new 3D-aware autoencoder architecture** that can compress multi-view images into a low-dimensional latent space and decode them into Gaussian Splats. 3. **Demonstrate how to efficiently sample diverse and realistic 3D scenes through a diffusion model in the latent space**, whether for unconditional generation or conditioned on input images. 4. **Prove that given a computational budget**, the proposed latent space method significantly improves the quality of results in both unconditional generation and generative reconstruction.

Sampling 3D Gaussian Scenes in Seconds with Latent Diffusion Models

L3DG: Latent 3D Gaussian Diffusion

Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation

Wonderland: Navigating 3D Scenes from a Single Image

Prometheus: 3D-Aware Latent Diffusion Models for Feed-Forward Text-to-3D Scene Generation

NeuralField-LDM: Scene Generation with Hierarchical Latent Diffusion Models

LN3Diff: Scalable Latent Neural Fields Diffusion for Speedy 3D Generation

Dynamic 3D Gaussian Fields for Urban Areas

Denoising Diffusion via Image-Based Rendering

Gaussian Scenes: Pose-Free Sparse-View Scene Reconstruction using Depth-Enhanced Diffusion Priors

GaussianDiffusion: 3D Gaussian Splatting for Denoising Diffusion Probabilistic Models with Structured Noise

AutoDecoding Latent 3D Diffusion Models

GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models

Inpaint3D: 3D Scene Content Generation using 2D Inpainting Diffusion

GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation

LT3SD: Latent Trees for 3D Scene Diffusion

GaussianDreamer: Fast Generation from Text to 3D Gaussian Splatting with Point Cloud Priors

Fast LiDAR Upsampling using Conditional Diffusion Models

DNGaussian: Optimizing Sparse-View 3D Gaussian Radiance Fields with Global-Local Depth Normalization

Diffusion Probabilistic Models for Scene-Scale 3D Categorical Data

Sp2360: Sparse-view 360 Scene Reconstruction using Cascaded 2D Diffusion Priors