GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents

Anthony Costarelli,Mat Allen,Roman Hauksson,Grace Sodunke,Suhas Hariharan,Carlson Cheng,Wenjie Li,Joshua Clymer,Arjun Yadav
2024-07-22
Abstract:Large language models have demonstrated remarkable few-shot performance on many natural language understanding tasks. Despite several demonstrations of using large language models in complex, strategic scenarios, there lacks a comprehensive framework for evaluating agents' performance across various types of reasoning found in games. To address this gap, we introduce GameBench, a cross-domain benchmark for evaluating strategic reasoning abilities of LLM agents. We focus on 9 different game environments, where each covers at least one axis of key reasoning skill identified in strategy games, and select games for which strategy explanations are unlikely to form a significant portion of models' pretraining corpuses. Our evaluations use GPT-3 and GPT-4 in their base form along with two scaffolding frameworks designed to enhance strategic reasoning ability: Chain-of-Thought (CoT) prompting and Reasoning Via Planning (RAP). Our results show that none of the tested models match human performance, and at worst GPT-4 performs worse than random action. CoT and RAP both improve scores but not comparable to human levels.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper attempts to address the inadequacies in evaluating the strategic reasoning capabilities of large language models (LLMs). Although LLMs have shown excellent performance in many natural language understanding tasks, there is a lack of a comprehensive framework to assess these models' performance in complex strategic scenarios. Specifically, the paper introduces **GAME BENCH**, a cross-domain benchmarking framework designed to evaluate the strategic reasoning capabilities of LLM agents across various types of games. ### Main Objectives 1. **Filling the Evaluation Gap**: Existing benchmarks mainly focus on practical, in-distribution knowledge, which can easily become saturated as models improve. GAME BENCH aims to evaluate the strategic reasoning capabilities of LLMs through a multi-player, cross-domain game environment, thereby avoiding this saturation. 2. **Diverse Game Environments**: Select 9 different game environments, each covering at least one key reasoning skill in strategic games. The selection criteria for these games are that their strategic explanations are unlikely to be a significant part of the model's pre-training corpus. 3. **Evaluating Different Models and Enhancement Methods**: Use GPT-3 and GPT-4 as base models, combined with two enhancement frameworks—Chain-of-Thought (CoT) prompting and Reasoning Via Planning (RAP)—to evaluate these models' performance in strategic reasoning tasks. 4. **Human Baseline Comparison**: Compare the performance of LLM agents with random baselines and human baselines to assess the actual performance of the models. ### Key Findings - **Overall Performance**: All tested models failed to reach human performance levels, with GPT-4 performing worse than random actions in the worst cases. - **Effectiveness of Enhancement Methods**: Both CoT and RAP can improve the models' scores but fail to reach human levels. Notably, CoT has a more significant improvement effect on GPT-4, making it perform best in some games. - **Game Sensitivity**: The models' performance varies greatly across different games, with GPT-4 performing particularly poorly in the "Battleship" game. ### Conclusion By introducing GAME BENCH, this paper provides a comprehensive framework for evaluating the strategic reasoning capabilities of LLM agents. Although enhancement methods like CoT and RAP can improve model performance, even the best configurations fall far short of human reasoning levels. This indicates that while LLMs perform well on in-distribution tasks, they still face challenges in handling out-of-distribution tasks. Future research can further explore how more complex enhancement methods can improve the strategic reasoning capabilities of LLMs.