Abstract:We present DiffuScene for indoor 3D scene synthesis based on a novel scene configuration denoising diffusion model. It generates 3D instance properties stored in an unordered object set and retrieves the most similar geometry for each object configuration, which is characterized as a concatenation of different attributes, including location, size, orientation, semantics, and geometry features. We introduce a diffusion network to synthesize a collection of 3D indoor objects by denoising a set of unordered object attributes. Unordered parametrization simplifies and eases the joint distribution approximation. The shape feature diffusion facilitates natural object placements, including symmetries. Our method enables many downstream applications, including scene completion, scene arrangement, and text-conditioned scene synthesis. Experiments on the 3D-FRONT dataset show that our method can synthesize more physically plausible and diverse indoor scenes than state-of-the-art methods. Extensive ablation studies verify the effectiveness of our design choice in scene diffusion models.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to generate realistic, semantically meaningful and diverse 3D indoor scenes. Specifically, the authors propose a denoising diffusion model named DiffuScene, aiming to achieve this goal by learning the distribution of 3D indoor scenes. These scenes include the semantic categories of objects, surface geometries and placement positions. ### Main problems and solutions 1. **Generating realistic 3D indoor scenes**: - Traditional methods usually regard this problem as a data - driven optimization task, which requires prior knowledge to drive scene optimization. However, defining precise rules is time - consuming and requires a great deal of artistic expertise, and the optimization process is often cumbersome and computationally inefficient. - DiffuScene avoids human - defined constraints and iterative optimization processes by introducing a diffusion model, and can more naturally generate complex scene configuration patterns. 2. **Improving the diversity and rationality of scene composition**: - Existing generative models such as GAN and VAE have limitations in diversity or fidelity. DiffuScene enhances the relationships between objects and the rationality of scene combinations by denoising multiple object properties simultaneously. 3. **Supporting multiple downstream applications**: - DiffuScene can not only generate unconditional scenes, but also be used for tasks such as partial - scene completion, scene rearrangement, and text - prompt - based scene synthesis. ### Specific implementation methods - **Scene representation**: Each scene is represented as an unordered set of objects, and each object is composed of its position, size, orientation, class label and shape code. - **Diffusion process**: By gradually adding Gaussian noise to object properties, a clean scene is gradually transformed into a noisy scene; in the reverse process, a denoising network is used to gradually remove the noise and restore the original scene. - **Denoising network**: Based on 1D convolution and attention mechanisms, it aggregates the features of different objects and captures the global scene context. - **Loss function**: It includes cross - entropy loss (Lsce) and intersection - over - union loss (Liou) to ensure that the generated scenes are reasonable and no object overlap occurs. ### Experimental results The paper conducted experiments on the 3D - FRONT dataset, and the results show that DiffuScene outperforms existing methods in multiple evaluation metrics such as FID, KID, and SCA, and can generate more diverse and reasonable 3D indoor scenes. ### Application examples - **Scene completion**: Generate a complete scene from a given partial scene, with higher diversity and fewer overlap problems. - **Scene rearrangement**: Predict reasonable placement positions and orientations according to a given set of objects, and generate a more natural scene layout. - **Text - based scene synthesis**: Generate a complete scene that meets the input requirements according to a text prompt describing part of the scene configuration. In conclusion, by introducing the denoising diffusion model, DiffuScene successfully solves the problems of diversity and rationality in 3D indoor scene generation and shows broad application potential.

DiffuScene: Denoising Diffusion Models for Generative Indoor Scene Synthesis

DisCoScene: Spatially Disentangled Generative Radiance Fields for Controllable 3D-Aware Scene Synthesis

Mixed Diffusion for 3D Indoor Scene Synthesis

DiffInDScene: Diffusion-based High-Quality 3D Indoor Scene Generation

Learning 3 D Scene Synthesis from Annotated RGB-D Images

Diffusion-based Generation, Optimization, and Planning in 3D Scenes

Denoising Diffusion via Image-Based Rendering

3D Scene Diffusion Guidance using Scene Graphs

EchoScene: Indoor Scene Generation via Information Echo over Scene Graph Diffusion

Novel 3D-Aware Composition Images Synthesis for Object Display with Diffusion Model.

Move Anything with Layered Scene Diffusion

DIScene: Object Decoupling and Interaction Modeling for Complex Scene Generation

RenderDiffusion: Image Diffusion for 3D Reconstruction, Inpainting and Generation

LT3SD: Latent Trees for 3D Scene Diffusion

External Knowledge Enhanced 3D Scene Generation from Sketch

DiffSF: Diffusion Models for Scene Flow Estimation

Pyramid Diffusion for Fine 3D Large Scene Generation

Sat2Scene: 3D Urban Scene Generation from Satellite Images with Diffusion

SceneFactor: Factored Latent 3D Diffusion for Controllable 3D Scene Generation

Diffusion Probabilistic Models for Scene-Scale 3D Categorical Data

ConsistDreamer: 3D-Consistent 2D Diffusion for High-Fidelity Scene Editing