Comp4D: LLM-Guided Compositional 4D Scene Generation

Dejia Xu,Hanwen Liang,Neel P. Bhatt,Hezhen Hu,Hanxue Liang,Konstantinos N. Plataniotis,Zhangyang Wang
2024-03-26
Abstract:Recent advancements in diffusion models for 2D and 3D content creation have sparked a surge of interest in generating 4D content. However, the scarcity of 3D scene datasets constrains current methodologies to primarily object-centric generation. To overcome this limitation, we present Comp4D, a novel framework for Compositional 4D Generation. Unlike conventional methods that generate a singular 4D representation of the entire scene, Comp4D innovatively constructs each 4D object within the scene separately. Utilizing Large Language Models (LLMs), the framework begins by decomposing an input text prompt into distinct entities and maps out their trajectories. It then constructs the compositional 4D scene by accurately positioning these objects along their designated paths. To refine the scene, our method employs a compositional score distillation technique guided by the pre-defined trajectories, utilizing pre-trained diffusion models across text-to-image, text-to-video, and text-to-3D domains. Extensive experiments demonstrate our outstanding 4D content creation capability compared to prior arts, showcasing superior visual quality, motion fidelity, and enhanced object interactions.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that the existing 4D content generation methods mainly focus on object - centered generation and lack the comprehensive generation ability for multiple objects and their interactions in complex scenes. Specifically, the current methods are limited by the scarcity of 3D scene datasets, which leads to most research mainly focusing on improving the score distillation technique to enhance the rendering of new perspectives. However, these methods can usually only generate the 4D representation of a single object and cannot handle complex 4D scenes containing multiple objects and their interactions. To overcome this limitation, the paper proposes the **Comp4D** framework, aiming to achieve the generation of complex 4D scenes. Different from traditional methods, **Comp4D** constructs a comprehensive 4D scene by decomposing the input text prompt into different entities, mapping out their trajectories, and then precisely locating these objects according to these trajectories. In addition, **Comp4D** also introduces a combined score distillation technique based on predefined trajectories, using pre - trained diffusion models to optimize across text - to - image, text - to - video, and text - to - 3D domains, thereby improving the visual quality, motion fidelity, and object - interaction effects of 4D content. The main contributions of the paper include: 1. **Proposing the **Comp4D** framework**, which realizes the creation of comprehensive 4D scenes for the first time. By decomposing the creation process of 4D scenes into the construction of individual 4D objects and their interactions, it overcomes the object - centered limitations of existing methods. 2. **Decomposing object motion into two parts: global displacement and local deformation**, using large - language models (LLMs) to design the global displacement of object motion, thereby reducing the burden of 4D representation and focusing on local deformation. 3. **Adopting a deformable 3D Gaussian model**, which enables the 4D representation to flexibly switch between single - object and multi - object rendering, and can stably optimize object motion even in the presence of potential occlusions. 4. **Verified through extensive experiments**, compared with the existing baseline methods, **Comp4D** performs excellently in terms of visual quality, motion authenticity, and object interaction.