Sampling 3D Gaussian Scenes in Seconds with Latent Diffusion Models

Paul Henderson,Melonie de Almeida,Daniela Ivanova,Titas Anciukevičius
2024-06-19
Abstract:We present a latent diffusion model over 3D scenes, that can be trained using only 2D image data. To achieve this, we first design an autoencoder that maps multi-view images to 3D Gaussian splats, and simultaneously builds a compressed latent representation of these splats. Then, we train a multi-view diffusion model over the latent space to learn an efficient generative model. This pipeline does not require object masks nor depths, and is suitable for complex scenes with arbitrary camera positions. We conduct careful experiments on two large-scale datasets of complex real-world scenes -- MVImgNet and RealEstate10K. We show that our approach enables generating 3D scenes in as little as 0.2 seconds, either from scratch, from a single input view, or from sparse input views. It produces diverse and high-quality results while running an order of magnitude faster than non-latent diffusion models and earlier NeRF-based generative models
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The paper aims to address the following main issues: ### Core Issues - **Efficient Generation of 3D Scenes**: Design a method that can quickly generate realistic 3D scenes using only 2D image data for training. ### Specific Challenges and Goals 1. **Lack of Large-Scale 3D Scene Datasets**: Existing 3D datasets are either too small to train generative models or contain isolated objects rather than complete scenes. 2. **Learning 3D Generative Models Directly from Multi-View Images**: Utilize large-scale datasets of existing multi-view images to learn 3D generative models without relying on 3D datasets. 3. **Efficient Sampling Process**: Current 3D generative models are very slow during sampling because they require expensive volumetric rendering operations. 4. **Handling Incomplete Observations of Scenes**: Reasonably infer information about areas of the scene that are not observed by multiple images. ### Solution Overview - **Propose a method based on latent space diffusion models** that can generate high-quality 3D scenes within seconds. - **Use Gaussian Splats as the 3D representation**, combined with an autoencoder to construct a compressed latent space representation. - **Train a denoising diffusion model in the latent space** to achieve efficient and high-quality 3D scene generation. - **Support multiple tasks**, including unconditional generation, single-image 3D reconstruction, and sparse-view 3D reconstruction. ### Main Contributions 1. **First to achieve a generative model for real-world scene distributions based on Gaussian Splats**. 2. **Design a new 3D-aware autoencoder architecture** that can compress multi-view images into a low-dimensional latent space and decode them into Gaussian Splats. 3. **Demonstrate how to efficiently sample diverse and realistic 3D scenes through a diffusion model in the latent space**, whether for unconditional generation or conditioned on input images. 4. **Prove that given a computational budget**, the proposed latent space method significantly improves the quality of results in both unconditional generation and generative reconstruction.