Abstract:Recent advancements in 3D diffusion-based semantic scene generation have gained attention. However, existing methods rely on unconditional generation and require multiple resampling steps when editing scenes, which significantly limits their controllability and flexibility. To this end, we propose SSEditor, a controllable Semantic Scene Editor that can generate specified target categories without multiple-step resampling. SSEditor employs a two-stage diffusion-based framework: (1) a 3D scene autoencoder is trained to obtain latent triplane features, and (2) a mask-conditional diffusion model is trained for customizable 3D semantic scene generation. In the second stage, we introduce a geometric-semantic fusion module that enhance the model's ability to learn geometric and semantic information. This ensures that objects are generated with correct positions, sizes, and categories. Extensive experiments on SemanticKITTI and CarlaSC demonstrate that SSEditor outperforms previous approaches in terms of controllability and flexibility in target generation, as well as the quality of semantic scene generation and reconstruction. More importantly, experiments on the unseen Occ-3D Waymo dataset show that SSEditor is capable of generating novel urban scenes, enabling the rapid construction of 3D scenes.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the deficiencies in controllability and flexibility of existing 3D semantic scene generation methods. Specifically: 1. **Weak controllability**: Existing unconditional generation methods limit users' ability to guide 3D scene creation, and methods of conditional generation based on the entire scene (such as real - scene) are too rigid. 2. **Multi - step resampling problem**: Editing specific local areas (such as adding or deleting objects) requires masking non - target areas and adopting a multi - step resampling process to redraw, which significantly increases the generation time and is difficult to control. 3. **Challenges in generating complex outdoor scenes**: Compared with indoor scenes and single objects, outdoor scenes are more difficult to generate due to their sparsity and complex representation. For example, voxel - based representation methods have a large number of empty voxels, resulting in high computational costs and a lot of redundant information. To solve these problems, the paper proposes SSEditor, a controllable semantic scene editor based on the diffusion model. SSEditor improves the controllability and flexibility of 3D semantic scene generation in the following ways: - **Two - stage framework**: - **First stage**: Train a 3D scene auto - encoder to obtain triplane features, thereby learning the geometric and semantic information of the scene. - **Second stage**: Train a mask - conditional diffusion model for customizable 3D semantic scene generation. - **Geometry - Semantic Fusion Module (GSFM)**: This module contains a geometric branch and a semantic branch, which process 3D masks and semantic labels and markers respectively, ensuring that the generated objects have the correct pose, size and category. - **3D mask asset library**: A 3D mask asset library has been created. Users can select or create various assets to customize 3D scenes, such as controllable scene inpainting, new urban scene generation, and removal of trailing artifacts of dynamic objects. In conclusion, SSEditor aims to provide a flexible and controllable method for 3D semantic scene generation, overcoming the limitations of existing methods in controllability and efficiency, especially performing well in generating complex outdoor scenes.

SSEditor: Controllable Mask-to-Scene Generation with Diffusion Model

Learning to Simulate Complex Scenes for Street Scene Segmentation

DisCoScene: Spatially Disentangled Generative Radiance Fields for Controllable 3D-Aware Scene Synthesis

3DSceneEditor: Controllable 3D Scene Editing with Gaussian Splatting

Diffusion-based Generation, Optimization, and Planning in 3D Scenes

SESAME: Semantic Editing of Scenes by Adding, Manipulating or Erasing Objects

Sat2Scene: 3D Urban Scene Generation from Satellite Images with Diffusion

ConsistDreamer: 3D-Consistent 2D Diffusion for High-Fidelity Scene Editing

Semantic Score Distillation Sampling for Compositional Text-to-3D Generation

Urban Scene Diffusion through Semantic Occupancy Map

Enhanced 3D Generation by 2D Editing

Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models

DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using Diffusion Models

SIEDOB: Semantic Image Editing by Disentangling Object and Background

Generating Images with 3D Annotations Using Diffusion Models

3DitScene: Editing Any Scene via Language-guided Disentangled Gaussian Splatting

Complex Scene Image Editing by Scene Graph Comprehension

SyntheOcc: Synthesize Geometric-Controlled Street View Images through 3D Semantic MPIs

MaskDiffusion: Exploiting Pre-Trained Diffusion Models for Semantic Segmentation

SCEdit: Efficient and Controllable Image Diffusion Generation via Skip Connection Editing