WorldGPT: A Sora-Inspired Video AI Agent as Rich World Models from Text and Image Inputs

Deshun Yang,Luhui Hu,Yu Tian,Zihao Li,Chris Kelly,Bang Yang,Cindy Yang,Yuexian Zou

2024-03-11

Abstract:Several text-to-video diffusion models have demonstrated commendable capabilities in synthesizing high-quality video content. However, it remains a formidable challenge pertaining to maintaining temporal consistency and ensuring action smoothness throughout the generated sequences. In this paper, we present an innovative video generation AI agent that harnesses the power of Sora-inspired multimodal learning to build skilled world models framework based on textual prompts and accompanying images. The framework includes two parts: prompt enhancer and full video translation. The first part employs the capabilities of ChatGPT to meticulously distill and proactively construct precise prompts for each subsequent step, thereby guaranteeing the utmost accuracy in prompt communication and accurate execution in following model operations. The second part employ compatible with existing advanced diffusion techniques to expansively generate and refine the key frame at the conclusion of a video. Then we can expertly harness the power of leading and trailing key frames to craft videos with enhanced temporal consistency and action smoothness. The experimental results confirm that our method has strong effectiveness and novelty in constructing world models from text and image inputs over the other methods.

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

The paper aims to address several key challenges in video generation, particularly temporal consistency (i.e., ensuring coherence and smoothness of actions between frames in the generated video sequence) and issues of diversity and creativity. Specifically, the paper proposes a new method called WorldGPT, which utilizes a multimodal learning framework inspired by Sora to construct proficient world models based on text prompts and accompanying images. This method is implemented through two main components: first, a prompt enhancer that uses ChatGPT to provide precise prompts for each subsequent step to ensure the accuracy of prompt communication; second, complete video translation, which combines existing diffusion techniques to generate and refine keyframes of the video and uses preceding and following keyframes to create videos with enhanced temporal consistency and smoothness of actions. Experimental results show that this method is more effective and novel in constructing world models from text and image inputs compared to other methods.

WorldGPT: A Sora-Inspired Video AI Agent as Rich World Models from Text and Image Inputs

Sora as an AGI World Model? A Complete Survey on Text-to-Video Generation

WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens

iVideoGPT: Interactive VideoGPTs are Scalable World Models

Kubrick: Multimodal Agent Collaborations for Synthetic Video Generation

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Generative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion Models

Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation

FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

RoboDreamer: Learning Compositional World Models for Robot Imagination

ART•V: Auto-Regressive Text-to-Video Generation with Diffusion Models

From Sora What We Can See: A Survey of Text-to-Video Generation

Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution

AesopAgent: Agent-driven Evolutionary System on Story-to-Video Production

VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning

StoryAgent: Customized Storytelling Video Generation via Multi-Agent Collaboration