SSEditor: Controllable Mask-to-Scene Generation with Diffusion Model

Haowen Zheng,Yanyan Liang
2024-11-19
Abstract:Recent advancements in 3D diffusion-based semantic scene generation have gained attention. However, existing methods rely on unconditional generation and require multiple resampling steps when editing scenes, which significantly limits their controllability and flexibility. To this end, we propose SSEditor, a controllable Semantic Scene Editor that can generate specified target categories without multiple-step resampling. SSEditor employs a two-stage diffusion-based framework: (1) a 3D scene autoencoder is trained to obtain latent triplane features, and (2) a mask-conditional diffusion model is trained for customizable 3D semantic scene generation. In the second stage, we introduce a geometric-semantic fusion module that enhance the model's ability to learn geometric and semantic information. This ensures that objects are generated with correct positions, sizes, and categories. Extensive experiments on SemanticKITTI and CarlaSC demonstrate that SSEditor outperforms previous approaches in terms of controllability and flexibility in target generation, as well as the quality of semantic scene generation and reconstruction. More importantly, experiments on the unseen Occ-3D Waymo dataset show that SSEditor is capable of generating novel urban scenes, enabling the rapid construction of 3D scenes.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the deficiencies in controllability and flexibility of existing 3D semantic scene generation methods. Specifically: 1. **Weak controllability**: Existing unconditional generation methods limit users' ability to guide 3D scene creation, and methods of conditional generation based on the entire scene (such as real - scene) are too rigid. 2. **Multi - step resampling problem**: Editing specific local areas (such as adding or deleting objects) requires masking non - target areas and adopting a multi - step resampling process to redraw, which significantly increases the generation time and is difficult to control. 3. **Challenges in generating complex outdoor scenes**: Compared with indoor scenes and single objects, outdoor scenes are more difficult to generate due to their sparsity and complex representation. For example, voxel - based representation methods have a large number of empty voxels, resulting in high computational costs and a lot of redundant information. To solve these problems, the paper proposes SSEditor, a controllable semantic scene editor based on the diffusion model. SSEditor improves the controllability and flexibility of 3D semantic scene generation in the following ways: - **Two - stage framework**: - **First stage**: Train a 3D scene auto - encoder to obtain triplane features, thereby learning the geometric and semantic information of the scene. - **Second stage**: Train a mask - conditional diffusion model for customizable 3D semantic scene generation. - **Geometry - Semantic Fusion Module (GSFM)**: This module contains a geometric branch and a semantic branch, which process 3D masks and semantic labels and markers respectively, ensuring that the generated objects have the correct pose, size and category. - **3D mask asset library**: A 3D mask asset library has been created. Users can select or create various assets to customize 3D scenes, such as controllable scene inpainting, new urban scene generation, and removal of trailing artifacts of dynamic objects. In conclusion, SSEditor aims to provide a flexible and controllable method for 3D semantic scene generation, overcoming the limitations of existing methods in controllability and efficiency, especially performing well in generating complex outdoor scenes.