VideoAgent: Self-Improving Video Generation

Achint Soni,Sreyas Venkataraman,Abhranil Chandra,Sebastian Fischmeister,Percy Liang,Bo Dai,Sherry Yang

2024-10-15

Abstract:Video generation has been used to generate visual plans for controlling robotic systems. Given an image observation and a language instruction, previous work has generated video plans which are then converted to robot controls to be executed. However, a major bottleneck in leveraging video generation for control lies in the quality of the generated videos, which often suffer from hallucinatory content and unrealistic physics, resulting in low task success when control actions are extracted from the generated videos. While scaling up dataset and model size provides a partial solution, integrating external feedback is both natural and essential for grounding video generation in the real world. With this observation, we propose VideoAgent for self-improving generated video plans based on external feedback. Instead of directly executing the generated video plan, VideoAgent first refines the generated video plans using a novel procedure which we call self-conditioning consistency, utilizing feedback from a pretrained vision-language model (VLM). As the refined video plan is being executed, VideoAgent collects additional data from the environment to further improve video plan generation. Experiments in simulated robotic manipulation from MetaWorld and iTHOR show that VideoAgent drastically reduces hallucination, thereby boosting success rate of downstream manipulation tasks. We further illustrate that VideoAgent can effectively refine real-robot videos, providing an early indicator that robotics can be an effective tool in grounding video generation in the physical world.

Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve This paper aims to address the application of video generation in robotic control. Specifically, although existing video generation techniques can create video plans based on image observations and language instructions, these generated videos often contain hallucinated content (such as objects randomly appearing or disappearing) and unrealistic physical phenomena (such as a robot's hand passing through objects), leading to a lower success rate of task execution when control actions are extracted from the generated videos. To overcome these issues, the paper proposes a framework called **VideoAgent**, which improves the generated video plans by incorporating external feedback. The main contributions include: 1. **Self-Conditioning Consistency**: Utilizing feedback from pre-trained Vision-Language Models (VLM) to iteratively optimize the generated videos, reducing hallucinated content and unrealistic physical phenomena. 2. **Online Data Collection and Model Fine-Tuning**: Collecting additional data from the environment while executing the generated video plans to further improve the video generation model. 3. **Experimental Validation**: Conducting experiments in simulated robotic operation environments (such as Meta-World and iTHOR), showing that VideoAgent significantly improves task success rates and performs well in generating videos for real robots. Through these methods, VideoAgent not only enhances the quality of generated videos but also increases the applicability and success rate of generated videos in actual robotic control.

VideoAgent: Self-Improving Video Generation

Kubrick: Multimodal Agent Collaborations for Synthetic Video Generation

This&That: Language-Gesture Controlled Video Generation for Robot Planning

StoryAgent: Customized Storytelling Video Generation via Multi-Agent Collaboration

VideoAgent: Long-form Video Understanding with Large Language Model as Agent

Video as the New Language for Real-World Decision Making

Learning Universal Policies via Text-Guided Video Generation

Grounding Video Models to Actions through Goal Conditioned Exploration

SPAgent: Adaptive Task Decomposition and Model Selection for General Video Generation and Editing

SInViG: A Self-Evolving Interactive Visual Agent for Human-Robot Interaction

Task Success is not Enough: Investigating the Use of Video-Language Models as Behavior Critics for Catching Undesirable Agent Behaviors

Action-conditioned video data improves predictability

VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

Video Language Planning

Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback

Adaptive Video Understanding Agent: Enhancing efficiency with dynamic frame sampling and feedback-driven reasoning

Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation

See, Plan, Predict: Language-guided Cognitive Planning with Video Prediction

Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation

Dreamitate: Real-World Visuomotor Policy Learning via Video Generation

RoboDreamer: Learning Compositional World Models for Robot Imagination