Pandora: Towards General World Model with Natural Language Actions and Video States

Jiannan Xiang,Guangyi Liu,Yi Gu,Qiyue Gao,Yuting Ning,Yuheng Zha,Zeyu Feng,Tianhua Tao,Shibo Hao,Yemin Shi,Zhengzhong Liu,Eric P. Xing,Zhiting Hu

2024-06-13

Abstract:World models simulate future states of the world in response to different actions. They facilitate interactive content creation and provides a foundation for grounded, long-horizon reasoning. Current foundation models do not fully meet the capabilities of general world models: large language models (LLMs) are constrained by their reliance on language modality and their limited understanding of the physical world, while video models lack interactive action control over the world simulations. This paper makes a step towards building a general world model by introducing Pandora, a hybrid autoregressive-diffusion model that simulates world states by generating videos and allows real-time control with free-text actions. Pandora achieves domain generality, video consistency, and controllability through large-scale pretraining and instruction tuning. Crucially, Pandora bypasses the cost of training-from-scratch by integrating a pretrained LLM (7B) and a pretrained video model, requiring only additional lightweight finetuning. We illustrate extensive outputs by Pandora across diverse domains (indoor/outdoor, natural/urban, human/robot, 2D/3D, etc.). The results indicate great potential of building stronger general world models with larger-scale training.

Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is to construct a General World Model, which can simulate future world states by generating videos and allow real - time control of the changes of these states using natural language. Specifically, although existing large - language models (LLMs) perform excellently in generating human languages, their understanding of the physical world is limited. They mainly rely on patterns in text data and lack in - depth understanding of the physical and temporal dynamics in the real world. Meanwhile, current video - generation models, although able to produce high - quality video content, lack the ability to control interactive actions during the world - simulation process. Therefore, this paper proposes the Pandora model, aiming to combine the advantages of both to achieve cross - domain video generation and real - time natural - language control. Pandora is a hybrid autoregressive - diffusion model. Through large - scale pre - training and instruction - tuning, it achieves domain - generality, video - consistency and controllability. This model can achieve the goal with only additional lightweight fine - tuning by integrating pre - trained large - language models (such as Vicuna - 7B - v1.5) and video - generation models (such as DynamiCrafter). Experimental results show that Pandora demonstrates strong video - generation and natural - language - control abilities in multiple different domains (indoor/outdoor, natural/urban, human/robot, 2D/3D, etc.), providing an important technological basis for constructing more powerful general - world models.

Pandora: Towards General World Model with Natural Language Actions and Video States

Understanding World or Predicting Future? A Comprehensive Survey of World Models

WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens

WorldGPT: Empowering LLM as Multimodal World Model

Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond

iVideoGPT: Interactive VideoGPTs are Scalable World Models

Language-Guided World Models: A Model-Based Approach to AI Control

Language Models Meet World Models: Embodied Experiences Enhance Language Models

3D-VLA: A 3D Vision-Language-Action Generative World Model

Making Large Language Models into World Models with Precondition and Effect Knowledge

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

BEVWorld: A Multimodal World Model for Autonomous Driving via Unified BEV Latent Space

WorldGPT: A Sora-Inspired Video AI Agent as Rich World Models from Text and Image Inputs

Pre-training Contextualized World Models with In-the-wild Videos for Reinforcement Learning

LanGWM: Language Grounded World Model

Evaluating World Models with LLM for Decision Making

How Far is Video Generation from World Model: A Physical Law Perspective

Generative World Explorer

AVID: Adapting Video Diffusion Models to World Models

Grounded Answers for Multi-agent Decision-making Problem through Generative World Model

WHALE: Towards Generalizable and Scalable World Models for Embodied Decision-making