Abstract:We present DiffuScene for indoor 3D scene synthesis based on a novel scene configuration denoising diffusion model. It generates 3D instance properties stored in an unordered object set and retrieves the most similar geometry for each object configuration, which is characterized as a concatenation of different attributes, including location, size, orientation, semantics, and geometry features. We introduce a diffusion network to synthesize a collection of 3D indoor objects by denoising a set of unordered object attributes. Unordered parametrization simplifies and eases the joint distribution approximation. The shape feature diffusion facilitates natural object placements, including symmetries. Our method enables many downstream applications, including scene completion, scene arrangement, and text-conditioned scene synthesis. Experiments on the 3D-FRONT dataset show that our method can synthesize more physically plausible and diverse indoor scenes than state-of-the-art methods. Extensive ablation studies verify the effectiveness of our design choice in scene diffusion models.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to generate realistic, semantically meaningful and diverse 3D indoor scenes. Specifically, the authors propose a denoising diffusion model named DiffuScene, aiming to achieve this goal by learning the distribution of 3D indoor scenes. These scenes include the semantic categories of objects, surface geometries and placement positions.
### Main problems and solutions
1. **Generating realistic 3D indoor scenes**:
- Traditional methods usually regard this problem as a data - driven optimization task, which requires prior knowledge to drive scene optimization. However, defining precise rules is time - consuming and requires a great deal of artistic expertise, and the optimization process is often cumbersome and computationally inefficient.
- DiffuScene avoids human - defined constraints and iterative optimization processes by introducing a diffusion model, and can more naturally generate complex scene configuration patterns.
2. **Improving the diversity and rationality of scene composition**:
- Existing generative models such as GAN and VAE have limitations in diversity or fidelity. DiffuScene enhances the relationships between objects and the rationality of scene combinations by denoising multiple object properties simultaneously.
3. **Supporting multiple downstream applications**:
- DiffuScene can not only generate unconditional scenes, but also be used for tasks such as partial - scene completion, scene rearrangement, and text - prompt - based scene synthesis.
### Specific implementation methods
- **Scene representation**: Each scene is represented as an unordered set of objects, and each object is composed of its position, size, orientation, class label and shape code.
- **Diffusion process**: By gradually adding Gaussian noise to object properties, a clean scene is gradually transformed into a noisy scene; in the reverse process, a denoising network is used to gradually remove the noise and restore the original scene.
- **Denoising network**: Based on 1D convolution and attention mechanisms, it aggregates the features of different objects and captures the global scene context.
- **Loss function**: It includes cross - entropy loss (Lsce) and intersection - over - union loss (Liou) to ensure that the generated scenes are reasonable and no object overlap occurs.
### Experimental results
The paper conducted experiments on the 3D - FRONT dataset, and the results show that DiffuScene outperforms existing methods in multiple evaluation metrics such as FID, KID, and SCA, and can generate more diverse and reasonable 3D indoor scenes.
### Application examples
- **Scene completion**: Generate a complete scene from a given partial scene, with higher diversity and fewer overlap problems.
- **Scene rearrangement**: Predict reasonable placement positions and orientations according to a given set of objects, and generate a more natural scene layout.
- **Text - based scene synthesis**: Generate a complete scene that meets the input requirements according to a text prompt describing part of the scene configuration.
In conclusion, by introducing the denoising diffusion model, DiffuScene successfully solves the problems of diversity and rationality in 3D indoor scene generation and shows broad application potential.