Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback

Hiroki Furuta,Heiga Zen,Dale Schuurmans,Aleksandra Faust,Yutaka Matsuo,Percy Liang,Sherry Yang
2024-12-04
Abstract:Large text-to-video models hold immense potential for a wide range of downstream applications. However, these models struggle to accurately depict dynamic object interactions, often resulting in unrealistic movements and frequent violations of real-world physics. One solution inspired by large language models is to align generated outputs with desired outcomes using external feedback. This enables the model to refine its responses autonomously, eliminating extensive manual data collection. In this work, we investigate the use of feedback to enhance the object dynamics in text-to-video models. We aim to answer a critical question: what types of feedback, paired with which specific self-improvement algorithms, can most effectively improve text-video alignment and realistic object interactions? We begin by deriving a unified probabilistic objective for offline RL finetuning of text-to-video models. This perspective highlights how design elements in existing algorithms like KL regularization and policy projection emerge as specific choices within a unified framework. We then use derived methods to optimize a set of text-video alignment metrics (e.g., CLIP scores, optical flow), but notice that they often fail to align with human perceptions of generation quality. To address this limitation, we propose leveraging vision-language models to provide more nuanced feedback specifically tailored to object dynamics in videos. Our experiments demonstrate that our method can effectively optimize a wide variety of rewards, with binary AI feedback driving the most significant improvements in video quality for dynamic interactions, as confirmed by both AI and human evaluations. Notably, we observe substantial gains when using reward signals derived from AI feedback, particularly in scenarios involving complex interactions between multiple objects and realistic depictions of objects falling.
Machine Learning,Artificial Intelligence,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: How to improve the authenticity and accuracy of dynamic object interactions in text - to - video generation models by introducing external feedback (especially feedback from large - scale vision - language models). Specifically, these models often generate unrealistic motions or phenomena that violate physical laws when generating dynamic scenes, resulting in generated content that does not match expectations and cannot meet the needs of practical applications. ### Main Problems and Solutions 1. **Problem Description**: - The current state - of - the - art text - to - video generation models have deficiencies in generating dynamic object interactions, such as unrealistic motions and violations of physical laws. - These problems impede the wide use of the models in practical applications, such as creative video content generation, animation production, and film production. 2. **Solution**: - Introduce an external feedback mechanism to optimize the content generated by the model. Through external feedback, the model can independently adjust its output to better conform to the desired results. - The author proposes a unified probabilistic objective framework for offline reinforcement learning to fine - tune text - to - video models. This framework regards existing algorithms (such as KL regularization and policy projection) as specific choices and optimizes a series of text - video alignment metrics (such as CLIP scores, optical flow, etc.) through this framework. - Propose to use vision - language models to provide more detailed feedback, especially for optimizing object dynamics in videos. Experiments show that binary AI feedback significantly improves video quality, especially in cases involving complex multi - object interactions and real - object falling scenarios. ### Specific Methods 1. **Unified Probabilistic Objective Framework**: - Derive a unified probabilistic objective function for offline reinforcement learning to fine - tune text - to - video models. This framework reveals how specific choices of existing algorithms (such as KL regularization and policy projection) can be understood from a unified perspective. 2. **Selection of Feedback Types**: - Test metric feedback based on semantics, human preferences, and dynamic characteristics. - Propose to use binary feedback provided by large - scale vision - language models to more accurately evaluate the quality of generated videos. 3. **Experimental Verification**: - The experimental results show that the proposed framework can effectively maximize various types of feedback, and especially binary AI feedback performs best in improving video quality and dynamic interactions. - Confirmed by AI and human evaluations, AI feedback significantly improves video quality, especially in complex multi - object interactions and real - object falling scenarios. ### Conclusion This research significantly improves the authenticity and accuracy of dynamic object interactions in text - to - video generation models by introducing external feedback, especially binary feedback from large - scale vision - language models. This provides a new direction for future research, especially in improving the physical consistency and visual authenticity of generated content.