Abstract:We present Frankenstein, a diffusion-based framework that can generate semantic-compositional 3D scenes in a single pass. Unlike existing methods that output a single, unified 3D shape, Frankenstein simultaneously generates multiple separated shapes, each corresponding to a semantically meaningful part. The 3D scene information is encoded in one single tri-plane tensor, from which multiple Singed Distance Function (SDF) fields can be decoded to represent the compositional shapes. During training, an auto-encoder compresses tri-planes into a latent space, and then the denoising diffusion process is employed to approximate the distribution of the compositional scenes. Frankenstein demonstrates promising results in generating room interiors as well as human avatars with automatically separated parts. The generated scenes facilitate many downstream applications, such as part-wise re-texturing, object rearrangement in the room or avatar cloth re-targeting. Our project page is available at: <a class="link-external link-https" href="https://wolfball.github.io/frankenstein/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem this paper attempts to address is the generation of 3D scenes with semantic compositionality. Specifically, existing 3D generation models typically output single, unified 3D shapes, where the semantic information within these shapes is intertwined with other attributes, making it difficult to separate. This results in the generated 3D assets being unusable directly in downstream applications (such as decomposing generated vehicle models into the body and rollable wheels in video games, or segmenting the body, limbs, hair, and clothing in 3D digital portraits). To tackle this challenge, the paper proposes the Frankenstein framework, which aims to directly generate 3D scenes containing multiple independent semantic components. Each component has a complete shape and can undergo operations such as partial texture resetting, rearrangement of objects within a room, or redirection of clothing in digital portraits. Frankenstein encodes 3D scene information through a tri-plane tensor and uses a denoising diffusion model to generate these scenes. The main challenges include: 1. The need for a universal 3D representation method that can model the complete shapes of multiple semantic components simultaneously. 2. Modeling the relationships between different semantic parts is very complex, requiring the relative positions between parts to be semantically and physically reasonable, such as avoiding penetration phenomena. Frankenstein addresses these issues through the following steps: 1. **Tri-plane fitting**: Converting training scenes into tri-plane tensors, implicitly encoding the compositional shape information and spatial relationships between components. 2. **Variational Autoencoder (VAE) training**: Compressing the tri-plane into a more compact latent tri-plane space, significantly improving computational efficiency. 3. **Conditional denoising**: Using a diffusion model to approximate the distribution of the latent tri-plane, thereby generating 3D scenes with semantic compositionality. The paper demonstrates the effectiveness of Frankenstein in generating interior rooms and compositional portraits, showing that the generated scenes not only exhibit excellent overall quality but also have advantages in the diversity of generated shapes. The generated scenes can support various downstream applications, such as partial texture resetting, rearrangement of objects within a room, or redirection of clothing in portraits.

Frankenstein: Generating Semantic-Compositional 3D Scenes in One Tri-Plane

DisCoScene: Spatially Disentangled Generative Radiance Fields for Controllable 3D-Aware Scene Synthesis

Novel 3D-Aware Composition Images Synthesis for Object Display with Diffusion Model.

Semantic Score Distillation Sampling for Compositional Text-to-3D Generation

3D-SceneDreamer: Text-Driven 3D-Consistent Scene Generation

StdGEN: Semantic-Decomposed 3D Character Generation from Single Images

Diffusion-based Generation, Optimization, and Planning in 3D Scenes

NeuSDFusion: A Spatial-Aware Generative Model for 3D Shape Completion, Reconstruction, and Generation

Fantasia3D: Disentangling Geometry and Appearance for High-quality Text-to-3D Content Creation

SeMv-3D: Towards Semantic and Mutil-view Consistency simultaneously for General Text-to-3D Generation with Triplane Priors

SceneCraft: Layout-Guided 3D Scene Generation

DiffInDScene: Diffusion-based High-Quality 3D Indoor Scene Generation

SDFusion: Multimodal 3D Shape Completion, Reconstruction, and Generation

One-2-3-45++: Fast Single Image to 3D Objects with Consistent Multi-View Generation and 3D Diffusion

Comp4D: LLM-Guided Compositional 4D Scene Generation

Indoor Scene Generation from a Collection of Semantic-Segmented Depth Images

DreamScene360: Unconstrained Text-to-3D Scene Generation with Panoramic Gaussian Splatting

BlockFusion: Expandable 3D Scene Generation using Latent Tri-plane Extrapolation

Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model

Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models