GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning

Jiaxi Lv,Yi Huang,Mingfu Yan,Jiancheng Huang,Jianzhuang Liu,Yifan Liu,Yafei Wen,Xiaoxin Chen,Shifeng Chen

2024-04-23

Abstract:Recent advances in text-to-video generation have harnessed the power of diffusion models to create visually compelling content conditioned on text prompts. However, they usually encounter high computational costs and often struggle to produce videos with coherent physical motions. To tackle these issues, we propose GPT4Motion, a training-free framework that leverages the planning capability of large language models such as GPT, the physical simulation strength of Blender, and the excellent image generation ability of text-to-image diffusion models to enhance the quality of video synthesis. Specifically, GPT4Motion employs GPT-4 to generate a Blender script based on a user textual prompt, which commands Blender's built-in physics engine to craft fundamental scene components that encapsulate coherent physical motions across frames. Then these components are inputted into Stable Diffusion to generate a video aligned with the textual prompt. Experimental results on three basic physical motion scenarios, including rigid object drop and collision, cloth draping and swinging, and liquid flow, demonstrate that GPT4Motion can generate high-quality videos efficiently in maintaining motion coherency and entity consistency. GPT4Motion offers new insights in text-to-video research, enhancing its quality and broadening its horizon for further explorations.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper focuses on solving the coherence and efficiency issues in the process of text-to-video generation. Existing methods often face high computational costs and struggle to produce videos with physically consistent motions. To address this, the paper proposes an untrained framework called GPT4Motion, which utilizes the planning capability of the GPT-4 language model, the physics simulation functionality of Blender, and the image generation capability of text-to-image diffusion models like Stable Diffusion to enhance the quality of video synthesis. The workflow of GPT4Motion includes the following steps: Firstly, GPT-4 generates Blender scripts based on the text prompts provided by the user. These scripts drive Blender's built-in physics engine to create basic scene elements with coherent physical motions. Then, these elements are inputted into Stable Diffusion to generate videos that align with the text prompts. This approach ensures that the videos are not only faithful to the text prompts but also exhibit consistent physical behavior between all frames. Experimental results demonstrate that GPT4Motion efficiently generates high-quality videos with coherent motions and entity consistency when handling basic physical motion scenarios such as rigid body falling and collision, cloth swinging, and liquid flowing. Compared to existing methods, GPT4Motion provides novel insights, improves the quality of text-to-video generation, and broadens the path for future research.

GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning

Motion Prompting: Controlling Video Generation with Motion Trajectories

MotionGPT: Human Motion Synthesis with Improved Diversity and Realism via GPT-3 Prompting

MotionGPT-2: A General-Purpose Motion-Language Model for Motion Generation and Understanding

MotionGPT: Human Motion as a Foreign Language

Generative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion Models

Plan, Posture and Go: Towards Open-vocabulary Text-to-Motion Generation

Motion Control for Enhanced Complex Action Video Generation

GPT-Connect: Interaction between Text-Driven Human Motion Generator and 3D Scenes in a Training-free Manner

T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations

PhysMotion: Physics-Grounded Dynamics From a Single Image

MotionDiffuse: Text-Driven Human Motion Generation With Diffusion Model

PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation

Text-guided 3D Human Motion Generation with Keyframe-based Parallel Skip Transformer

LivePhoto: Real Image Animation with Text-guided Motion Control

Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with Image Diffusion Model

Motion-Agent: A Conversational Framework for Human Motion Generation with LLMs

Local Action-Guided Motion Diffusion Model for Text-to-Motion Generation

TC4D: Trajectory-Conditioned Text-to-4D Generation

MotionGPT: Finetuned LLMs Are General-Purpose Motion Generators