Abstract:Large text-to-video models hold immense potential for a wide range of downstream applications. However, these models struggle to accurately depict dynamic object interactions, often resulting in unrealistic movements and frequent violations of real-world physics. One solution inspired by large language models is to align generated outputs with desired outcomes using external feedback. This enables the model to refine its responses autonomously, eliminating extensive manual data collection. In this work, we investigate the use of feedback to enhance the object dynamics in text-to-video models. We aim to answer a critical question: what types of feedback, paired with which specific self-improvement algorithms, can most effectively improve text-video alignment and realistic object interactions? We begin by deriving a unified probabilistic objective for offline RL finetuning of text-to-video models. This perspective highlights how design elements in existing algorithms like KL regularization and policy projection emerge as specific choices within a unified framework. We then use derived methods to optimize a set of text-video alignment metrics (e.g., CLIP scores, optical flow), but notice that they often fail to align with human perceptions of generation quality. To address this limitation, we propose leveraging vision-language models to provide more nuanced feedback specifically tailored to object dynamics in videos. Our experiments demonstrate that our method can effectively optimize a wide variety of rewards, with binary AI feedback driving the most significant improvements in video quality for dynamic interactions, as confirmed by both AI and human evaluations. Notably, we observe substantial gains when using reward signals derived from AI feedback, particularly in scenarios involving complex interactions between multiple objects and realistic depictions of objects falling.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: How to improve the authenticity and accuracy of dynamic object interactions in text - to - video generation models by introducing external feedback (especially feedback from large - scale vision - language models). Specifically, these models often generate unrealistic motions or phenomena that violate physical laws when generating dynamic scenes, resulting in generated content that does not match expectations and cannot meet the needs of practical applications. ### Main Problems and Solutions 1. **Problem Description**: - The current state - of - the - art text - to - video generation models have deficiencies in generating dynamic object interactions, such as unrealistic motions and violations of physical laws. - These problems impede the wide use of the models in practical applications, such as creative video content generation, animation production, and film production. 2. **Solution**: - Introduce an external feedback mechanism to optimize the content generated by the model. Through external feedback, the model can independently adjust its output to better conform to the desired results. - The author proposes a unified probabilistic objective framework for offline reinforcement learning to fine - tune text - to - video models. This framework regards existing algorithms (such as KL regularization and policy projection) as specific choices and optimizes a series of text - video alignment metrics (such as CLIP scores, optical flow, etc.) through this framework. - Propose to use vision - language models to provide more detailed feedback, especially for optimizing object dynamics in videos. Experiments show that binary AI feedback significantly improves video quality, especially in cases involving complex multi - object interactions and real - object falling scenarios. ### Specific Methods 1. **Unified Probabilistic Objective Framework**: - Derive a unified probabilistic objective function for offline reinforcement learning to fine - tune text - to - video models. This framework reveals how specific choices of existing algorithms (such as KL regularization and policy projection) can be understood from a unified perspective. 2. **Selection of Feedback Types**: - Test metric feedback based on semantics, human preferences, and dynamic characteristics. - Propose to use binary feedback provided by large - scale vision - language models to more accurately evaluate the quality of generated videos. 3. **Experimental Verification**: - The experimental results show that the proposed framework can effectively maximize various types of feedback, and especially binary AI feedback performs best in improving video quality and dynamic interactions. - Confirmed by AI and human evaluations, AI feedback significantly improves video quality, especially in complex multi - object interactions and real - object falling scenarios. ### Conclusion This research significantly improves the authenticity and accuracy of dynamic object interactions in text - to - video generation models by introducing external feedback, especially binary feedback from large - scale vision - language models. This provides a new direction for future research, especially in improving the physical consistency and visual authenticity of generated content.

Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback

Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback

LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment

VideoAgent: Self-Improving Video Generation

VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation

Video as the New Language for Real-World Decision Making

Evaluation of Text-to-Video Generation Models: A Dynamics Perspective

Enhancing Robotic Manipulation with AI Feedback from Multimodal Large Language Models

Fine-grained Controllable Video Generation via Object Appearance and Context

Revolve: Optimizing AI Systems by Tracking Response Evolution in Textual Optimization

Beyond Thumbs Up/Down: Untangling Challenges of Fine-Grained Feedback for Text-to-Image Generation

Probabilistic Adaptation of Text-to-Video Models

FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement

Large Language Models are Frame-level Directors for Zero-shot Text-to-Video Generation

Video-Language Critic: Transferable Reward Functions for Language-Conditioned Robotics

Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning

Data-Efficient Alignment of Large Language Models with Human Feedback Through Natural Language

Task Success is not Enough: Investigating the Use of Video-Language Models as Behavior Critics for Catching Undesirable Agent Behaviors

T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design

Neuro-Symbolic Evaluation of Text-to-Video Models using Formal Verification