Abstract:Large language models have demonstrated remarkable few-shot performance on many natural language understanding tasks. Despite several demonstrations of using large language models in complex, strategic scenarios, there lacks a comprehensive framework for evaluating agents' performance across various types of reasoning found in games. To address this gap, we introduce GameBench, a cross-domain benchmark for evaluating strategic reasoning abilities of LLM agents. We focus on 9 different game environments, where each covers at least one axis of key reasoning skill identified in strategy games, and select games for which strategy explanations are unlikely to form a significant portion of models' pretraining corpuses. Our evaluations use GPT-3 and GPT-4 in their base form along with two scaffolding frameworks designed to enhance strategic reasoning ability: Chain-of-Thought (CoT) prompting and Reasoning Via Planning (RAP). Our results show that none of the tested models match human performance, and at worst GPT-4 performs worse than random action. CoT and RAP both improve scores but not comparable to human levels.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper attempts to address the inadequacies in evaluating the strategic reasoning capabilities of large language models (LLMs). Although LLMs have shown excellent performance in many natural language understanding tasks, there is a lack of a comprehensive framework to assess these models' performance in complex strategic scenarios. Specifically, the paper introduces **GAME BENCH**, a cross-domain benchmarking framework designed to evaluate the strategic reasoning capabilities of LLM agents across various types of games. ### Main Objectives 1. **Filling the Evaluation Gap**: Existing benchmarks mainly focus on practical, in-distribution knowledge, which can easily become saturated as models improve. GAME BENCH aims to evaluate the strategic reasoning capabilities of LLMs through a multi-player, cross-domain game environment, thereby avoiding this saturation. 2. **Diverse Game Environments**: Select 9 different game environments, each covering at least one key reasoning skill in strategic games. The selection criteria for these games are that their strategic explanations are unlikely to be a significant part of the model's pre-training corpus. 3. **Evaluating Different Models and Enhancement Methods**: Use GPT-3 and GPT-4 as base models, combined with two enhancement frameworks—Chain-of-Thought (CoT) prompting and Reasoning Via Planning (RAP)—to evaluate these models' performance in strategic reasoning tasks. 4. **Human Baseline Comparison**: Compare the performance of LLM agents with random baselines and human baselines to assess the actual performance of the models. ### Key Findings - **Overall Performance**: All tested models failed to reach human performance levels, with GPT-4 performing worse than random actions in the worst cases. - **Effectiveness of Enhancement Methods**: Both CoT and RAP can improve the models' scores but fail to reach human levels. Notably, CoT has a more significant improvement effect on GPT-4, making it perform best in some games. - **Game Sensitivity**: The models' performance varies greatly across different games, with GPT-4 performing particularly poorly in the "Battleship" game. ### Conclusion By introducing GAME BENCH, this paper provides a comprehensive framework for evaluating the strategic reasoning capabilities of LLM agents. Although enhancement methods like CoT and RAP can improve model performance, even the best configurations fall far short of human reasoning levels. This indicates that while LLMs perform well on in-distribution tasks, they still face challenges in handling out-of-distribution tasks. Future research can further explore how more complex enhancement methods can improve the strategic reasoning capabilities of LLMs.

GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents

GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations

TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs

Beyond Outcomes: Transparent Assessment of LLM Reasoning in Games

Strategic Reasoning with Language Models

The Emergence of Strategic Reasoning of Large Language Models

LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language Models

GameArena: Evaluating LLM Reasoning through Live Computer Games

Are Large Language Models Strategic Decision Makers? A Study of Performance and Bias in Two-Player Non-Zero-Sum Games

Explore the Reasoning Capability of LLMs in the Chess Testbed

Codenames as a Benchmark for Large Language Models

Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation and Enhancement Using the StepGame Benchmark

Show, Don't Tell: Evaluating Large Language Models Beyond Textual Understanding with ChildPlay

LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models

How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments

ACPBench: Reasoning about Action, Change, and Planning

From Text to Tactic: Evaluating LLMs Playing the Game of Avalon

Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models' Reasoning Performance

Benchmarking Large Language Model (LLM) Performance for Game Playing via Tic-Tac-Toe

MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration

GameTraversalBenchmark: Evaluating Planning Abilities Of Large Language Models Through Traversing 2D Game Maps