Evaluating World Models with LLM for Decision Making

Chang Yang,Xinrun Wang,Junzhe Jiang,Qinggang Zhang,Xiao Huang
2024-11-14
Abstract:World model emerges as a key module in decision making, where MuZero and Dreamer achieve remarkable successes in complex tasks. Recent work leverages Large Language Models (LLMs) as general world simulators to simulate the dynamics of the world due to their generalizability. LLMs also serve as the world model for deliberative reasoning in Reasoning via Planning (RAP) and Tree of Thought (ToT). However, the world models are either evaluated as a general world simulator, or as a functional module of the agent, i.e., predicting the transitions to assist the planning. In this work, we propose a comprehensive evaluation of the world models with LLMs from the decision making perspective. Specifically, we leverage the 31 diverse environments from (Wang et al., 2023;2024) and curate the rule-based policy of each environment for the diverse evaluation. Then, we design three main tasks, i.e., policy verification, action proposal, and policy planning, where the world models can be used for decision making solely. Finally, we conduct the comprehensive evaluation of the advanced LLMs, i.e., GPT-4o and GPT-4o-mini, on the environments for the three main tasks under various settings. The key observations include: i) GPT-4o significantly outperforms GPT-4o-mini on the three main tasks, especially for the tasks which require the domain knowledge, ii) the performance of the world model with LLM will be decreased for long-term decision-making tasks, and iii) the combination of different functionalities of the world model will brings additional unstabilities of the performance.
Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: How to comprehensively evaluate world models based on large - language models (LLMs) from the perspective of decision - making. Specifically, the paper focuses on the following points: 1. **Limitations of existing evaluation methods**: - Most current research either evaluates world models as general - purpose world simulators or uses them as functional modules of agents to predict state transitions to assist in planning. - This evaluation method fails to fully reflect the performance of world models in actual decision - making, especially in complex tasks and long - term decision - making. 2. **Proposing a new evaluation framework**: - In order to more accurately evaluate the role of world models in decision - making, the paper proposes a comprehensive evaluation framework aimed at comprehensively evaluating world models from the perspective of decision - making. - This framework includes three main tasks: Policy Verification, Action Proposal, and Policy Planning. 3. **Experimental design and result analysis**: - The paper uses 31 diverse environments for evaluation. These environments cover from daily tasks (such as washing clothes) to scientific tasks (such as forging keys) and have different difficulty levels. - Advanced large - language models (such as GPT - 4o and GPT - 4o - mini) are used in the experiments, and extensive tests are carried out under different settings to evaluate the performance of these models on the three main tasks. ### Main findings 1. **GPT - 4o is significantly better than GPT - 4o - mini**: - On tasks requiring domain knowledge, the performance of GPT - 4o is particularly prominent. - As the number of steps requiring verification or planning increases, the performance gap between GPT - 4o and GPT - 4o - mini also gradually widens. 2. **Challenges of long - term decision - making tasks**: - For long - term decision - making tasks, the performance of world models will decline, which may be due to the cumulative impact of prediction errors. - This phenomenon indicates that large - language models may have limitations when dealing with long - term decision - making tasks. 3. **Instability brought by different functional combinations**: - When combining the Policy Verification and Action Proposal functions, the performance of the model shows greater randomness, which suggests that we need to further decouple these modules for comprehensive evaluation. ### Conclusion By proposing a new evaluation framework, the paper aims to more comprehensively evaluate the performance of world models based on large - language models in decision - making. The experimental results not only reveal the superiority of GPT - 4o in multiple tasks but also point out the shortcomings of current methods in long - term decision - making tasks. Future research can further explore how to improve these models to deal with more complex decision - making scenarios.