Abstract:World model emerges as a key module in decision making, where MuZero and Dreamer achieve remarkable successes in complex tasks. Recent work leverages Large Language Models (LLMs) as general world simulators to simulate the dynamics of the world due to their generalizability. LLMs also serve as the world model for deliberative reasoning in Reasoning via Planning (RAP) and Tree of Thought (ToT). However, the world models are either evaluated as a general world simulator, or as a functional module of the agent, i.e., predicting the transitions to assist the planning. In this work, we propose a comprehensive evaluation of the world models with LLMs from the decision making perspective. Specifically, we leverage the 31 diverse environments from (Wang et al., 2023;2024) and curate the rule-based policy of each environment for the diverse evaluation. Then, we design three main tasks, i.e., policy verification, action proposal, and policy planning, where the world models can be used for decision making solely. Finally, we conduct the comprehensive evaluation of the advanced LLMs, i.e., GPT-4o and GPT-4o-mini, on the environments for the three main tasks under various settings. The key observations include: i) GPT-4o significantly outperforms GPT-4o-mini on the three main tasks, especially for the tasks which require the domain knowledge, ii) the performance of the world model with LLM will be decreased for long-term decision-making tasks, and iii) the combination of different functionalities of the world model will brings additional unstabilities of the performance.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: How to comprehensively evaluate world models based on large - language models (LLMs) from the perspective of decision - making. Specifically, the paper focuses on the following points: 1. **Limitations of existing evaluation methods**: - Most current research either evaluates world models as general - purpose world simulators or uses them as functional modules of agents to predict state transitions to assist in planning. - This evaluation method fails to fully reflect the performance of world models in actual decision - making, especially in complex tasks and long - term decision - making. 2. **Proposing a new evaluation framework**: - In order to more accurately evaluate the role of world models in decision - making, the paper proposes a comprehensive evaluation framework aimed at comprehensively evaluating world models from the perspective of decision - making. - This framework includes three main tasks: Policy Verification, Action Proposal, and Policy Planning. 3. **Experimental design and result analysis**: - The paper uses 31 diverse environments for evaluation. These environments cover from daily tasks (such as washing clothes) to scientific tasks (such as forging keys) and have different difficulty levels. - Advanced large - language models (such as GPT - 4o and GPT - 4o - mini) are used in the experiments, and extensive tests are carried out under different settings to evaluate the performance of these models on the three main tasks. ### Main findings 1. **GPT - 4o is significantly better than GPT - 4o - mini**: - On tasks requiring domain knowledge, the performance of GPT - 4o is particularly prominent. - As the number of steps requiring verification or planning increases, the performance gap between GPT - 4o and GPT - 4o - mini also gradually widens. 2. **Challenges of long - term decision - making tasks**: - For long - term decision - making tasks, the performance of world models will decline, which may be due to the cumulative impact of prediction errors. - This phenomenon indicates that large - language models may have limitations when dealing with long - term decision - making tasks. 3. **Instability brought by different functional combinations**: - When combining the Policy Verification and Action Proposal functions, the performance of the model shows greater randomness, which suggests that we need to further decouple these modules for comprehensive evaluation. ### Conclusion By proposing a new evaluation framework, the paper aims to more comprehensively evaluate the performance of world models based on large - language models in decision - making. The experimental results not only reveal the superiority of GPT - 4o in multiple tasks but also point out the shortcomings of current methods in long - term decision - making tasks. Future research can further explore how to improve these models to deal with more complex decision - making scenarios.

Evaluating World Models with LLM for Decision Making

WorldGPT: Empowering LLM as Multimodal World Model

PIANIST: Learning Partially Observable World Models with LLMs for Multi-Agent Decision Making

Language Models Meet World Models: Embodied Experiences Enhance Language Models

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

Language-Guided World Models: A Model-Based Approach to AI Control

Making Large Language Models into World Models with Precondition and Effect Knowledge

How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments

Leveraging Pre-trained Large Language Models to Construct and Utilize World Models for Model-based Task Planning

Can Language Models Serve as Text-Based World Simulators?

UniZero: Generalized and Efficient Planning with Scalable Latent World Models

Understanding World or Predicting Future? A Comprehensive Survey of World Models

Building Decision Making Models Through Language Model Regime

Grounding Large Language Models In Embodied Environment With Imperfect World Models

On the Decision-Making Abilities in Role-Playing using Large Language Models

LLM-State: Open World State Representation for Long-horizon Task Planning with Large Language Model

Adaptive and transparent decision-making in autonomous robots through graph-structured world models

Introspective Tips: Large Language Model for In-Context Decision Making

WALL-E: World Alignment by Rule Learning Improves World Model-based LLM Agents

LLaMA Rider: Spurring Large Language Models to Explore the Open World