Abstract:Recent advancements in diffusion models for 2D and 3D content creation have sparked a surge of interest in generating 4D content. However, the scarcity of 3D scene datasets constrains current methodologies to primarily object-centric generation. To overcome this limitation, we present Comp4D, a novel framework for Compositional 4D Generation. Unlike conventional methods that generate a singular 4D representation of the entire scene, Comp4D innovatively constructs each 4D object within the scene separately. Utilizing Large Language Models (LLMs), the framework begins by decomposing an input text prompt into distinct entities and maps out their trajectories. It then constructs the compositional 4D scene by accurately positioning these objects along their designated paths. To refine the scene, our method employs a compositional score distillation technique guided by the pre-defined trajectories, utilizing pre-trained diffusion models across text-to-image, text-to-video, and text-to-3D domains. Extensive experiments demonstrate our outstanding 4D content creation capability compared to prior arts, showcasing superior visual quality, motion fidelity, and enhanced object interactions.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that the existing 4D content generation methods mainly focus on object - centered generation and lack the comprehensive generation ability for multiple objects and their interactions in complex scenes. Specifically, the current methods are limited by the scarcity of 3D scene datasets, which leads to most research mainly focusing on improving the score distillation technique to enhance the rendering of new perspectives. However, these methods can usually only generate the 4D representation of a single object and cannot handle complex 4D scenes containing multiple objects and their interactions. To overcome this limitation, the paper proposes the **Comp4D** framework, aiming to achieve the generation of complex 4D scenes. Different from traditional methods, **Comp4D** constructs a comprehensive 4D scene by decomposing the input text prompt into different entities, mapping out their trajectories, and then precisely locating these objects according to these trajectories. In addition, **Comp4D** also introduces a combined score distillation technique based on predefined trajectories, using pre - trained diffusion models to optimize across text - to - image, text - to - video, and text - to - 3D domains, thereby improving the visual quality, motion fidelity, and object - interaction effects of 4D content. The main contributions of the paper include: 1. **Proposing the **Comp4D** framework**, which realizes the creation of comprehensive 4D scenes for the first time. By decomposing the creation process of 4D scenes into the construction of individual 4D objects and their interactions, it overcomes the object - centered limitations of existing methods. 2. **Decomposing object motion into two parts: global displacement and local deformation**, using large - language models (LLMs) to design the global displacement of object motion, thereby reducing the burden of 4D representation and focusing on local deformation. 3. **Adopting a deformable 3D Gaussian model**, which enables the 4D representation to flexibly switch between single - object and multi - object rendering, and can stably optimize object motion even in the presence of potential occlusions. 4. **Verified through extensive experiments**, compared with the existing baseline methods, **Comp4D** performs excellently in terms of visual quality, motion authenticity, and object interaction.

Comp4D: LLM-Guided Compositional 4D Scene Generation

DisCoScene: Spatially Disentangled Generative Radiance Fields for Controllable 3D-Aware Scene Synthesis

Trans4D: Realistic Geometry-Aware Transition for Compositional Text-to-4D Synthesis

PaintScene4D: Consistent 4D Scene Generation from Text Prompts

4DGen: Grounded 4D Content Generation with Spatial-temporal Consistency

4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models

Compositional 3D-aware Video Generation with LLM Director

CT4D: Consistent Text-to-4D Generation with Animatable Meshes

CompGS: Unleashing 2D Compositionality for Compositional Text-to-3D via Dynamically Optimizing 3D Gaussians

Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model

4Dynamic: Text-to-4D Generation with Hybrid Priors

EG4D: Explicit Generation of 4D Object without Score Distillation

DreamScape: 3D Scene Creation via Gaussian Splatting joint Correlation Modeling

Align Your Gaussians: Text-to-4D with Dynamic 3D Gaussians and Composed Diffusion Models

Novel 3D-Aware Composition Images Synthesis for Object Display with Diffusion Model.

Semantic Score Distillation Sampling for Compositional Text-to-3D Generation

GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs

Animate124: Animating One Image to 4D Dynamic Scene

Generative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion Models