PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation

Qiyao Xue,Xiangyu Yin,Boyuan Yang,Wei Gao
2024-12-01
Abstract:Text-to-video (T2V) generation has been recently enabled by transformer-based diffusion models, but current T2V models lack capabilities in adhering to the real-world common knowledge and physical rules, due to their limited understanding of physical realism and deficiency in temporal modeling. Existing solutions are either data-driven or require extra model inputs, but cannot be generalizable to out-of-distribution domains. In this paper, we present PhyT2V, a new data-independent T2V technique that expands the current T2V model's capability of video generation to out-of-distribution domains, by enabling chain-of-thought and step-back reasoning in T2V prompting. Our experiments show that PhyT2V improves existing T2V models' adherence to real-world physical rules by 2.3x, and achieves 35% improvement compared to T2V prompt enhancers. The source codes are available at: <a class="link-external link-https" href="https://github.com/pittisl/PhyT2V" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that current text - to - video generation models (T2V models) cannot well abide by the common sense and physical rules of the real world when generating videos. Specifically, although existing T2V models can generate complex and realistic scenes, they perform poorly when dealing with physical reality issues such as quantity, materials, fluid dynamics, gravity, motion, collision, and causality. These problems limit the application of these models in real - world simulations, especially when dealing with areas outside the training data. The paper proposes a new technology, PhyT2V, which aims to improve the capabilities of existing T2V models by embedding real - world knowledge and physical rules in text prompts, enabling them to better generate videos that conform to physical laws, especially when dealing with unseen areas. PhyT2V realizes the step - by - step guidance and iterative optimization of the video generation process by using large - language models (LLM) for chain - of - thought (CoT) and step - back reasoning.