Abstract:This paper introduces the concept of Language-Guided World Models (LWMs) -- probabilistic models that can simulate environments by reading texts. Agents equipped with these models provide humans with more extensive and efficient control, allowing them to simultaneously alter agent behaviors in multiple tasks via natural verbal communication. In this work, we take initial steps in developing robust LWMs that can generalize to compositionally novel language descriptions. We design a challenging world modeling benchmark based on the game of MESSENGER (Hanjie et al., 2021), featuring evaluation settings that require varying degrees of compositional generalization. Our experiments reveal the lack of generalizability of the state-of-the-art Transformer model, as it offers marginal improvements in simulation quality over a no-text baseline. We devise a more robust model by fusing the Transformer with the EMMA attention mechanism (Hanjie et al., 2021). Our model substantially outperforms the Transformer and approaches the performance of a model with an oracle semantic parsing and grounding capability. To demonstrate the practicality of this model in improving AI safety and transparency, we simulate a scenario in which the model enables an agent to present plans to a human before execution, and to revise plans based on their language feedback.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper aims to address how to enhance the capabilities of world models through language-guided methods to improve the controllability and safety of AI agents. Specifically, the paper proposes **Language-Guided World Models (LWMs)**, which are probabilistic models capable of simulating environments by reading text. ### Main Issues 1. **Limitations of Traditional World Models**: - Traditional world models can only be adjusted through observational data, which is not suitable for conveying complex and abstract human intentions. - Collecting observational data requires actual operations in the environment, which is expensive, time-consuming, and risky. 2. **Challenges of Language Guidance**: - Combining natural language descriptions with environmental dynamics is a complex task because language descriptions can be very rich and complex, covering a wide range of concepts such as entity names, appearances, movements, interactions, spatial and temporal relationships, etc. - In natural environments, especially when describing artificial environments (such as games), new concepts are often introduced but may not always be clearly defined. ### Solutions 1. **Proposing LWMs**: - LWMs are world models that can be effectively regulated through human natural language communication. These models inherit all the advantages of model-based agents while being able to incorporate language supervision. - This capability reduces human teaching efforts and lowers the risk of agents taking harmful actions while exploring environmental dynamics. 2. **Building Benchmark Tests**: - The paper constructs a benchmark test based on the game "MESSENGER" to evaluate the compositional generalization ability of LWMs. This benchmark includes three different difficulty evaluation settings, each testing different degrees of compositional generalization. - Through this benchmark, researchers found that existing Transformer models perform poorly in more difficult evaluation settings. Even when provided with true decoupled representations of observational data, they fail to learn generalizable grounding functions. 3. **Improving Model Architecture**: - Researchers combined the Transformer model with the EMMA attention mechanism to design a new model architecture. This new architecture performs excellently in the most challenging evaluation settings, significantly outperforming baseline models and approaching models with oracle semantic parsing and grounding capabilities. ### Application Prospects - **Improving AI Safety and Transparency**: LWMs can enable agents to generate execution plans before performing tasks and invite human supervisors to review these plans. Additionally, humans can adjust plans by modifying the agent's world model, thereby improving the agent's performance. - **Reducing the Need for Interaction Experience**: LWMs can utilize pre-existing texts, reducing the need to collect interaction experiences in the environment, saving manpower and time. In summary, by proposing LWMs, the paper addresses the limitations of traditional world models in language guidance, providing new ideas and methods to improve the controllability and safety of AI agents.

Language-Guided World Models: A Model-Based Approach to AI Control

Language Models Meet World Models: Embodied Experiences Enhance Language Models

Can Language Models Serve as Text-Based World Simulators?

LanGWM: Language Grounded World Model

WALL-E: World Alignment by Rule Learning Improves World Model-based LLM Agents

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents

WorldGPT: Empowering LLM as Multimodal World Model

Evaluating World Models with LLM for Decision Making

Making Large Language Models into World Models with Precondition and Effect Knowledge

Evaluating and Enhancing LLMs Agent based on Theory of Mind in Guandan: A Multi-Player Cooperative Game under Imperfect Information

World Models: The Safety Perspective

Grounding Large Language Models In Embodied Environment With Imperfect World Models

Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents

Leveraging Pre-trained Large Language Models to Construct and Utilize World Models for Model-based Task Planning

Plan, Eliminate, and Track -- Language Models are Good Teachers for Embodied Agents

PIANIST: Learning Partially Observable World Models with LLMs for Multi-Agent Decision Making

Large Language Model based Multi-Agents: A Survey of Progress and Challenges

Theory of Mind for Multi-Agent Collaboration via Large Language Models

Mental Modeling of Reinforcement Learning Agents by Language Models

BWArea Model: Learning World Model, Inverse Dynamics, and Policy for Controllable Language Generation