GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations

Jinhao Duan,Renming Zhang,James Diffenderfer,Bhavya Kailkhura,Lichao Sun,Elias Stengel-Eskin,Mohit Bansal,Tianlong Chen,Kaidi Xu

2024-06-11

Abstract:As Large Language Models (LLMs) are integrated into critical real-world applications, their strategic and logical reasoning abilities are increasingly crucial. This paper evaluates LLMs' reasoning abilities in competitive environments through game-theoretic tasks, e.g., board and card games that require pure logic and strategic reasoning to compete with opponents. We first propose GTBench, a language-driven environment composing 10 widely recognized tasks, across a comprehensive game taxonomy: complete versus incomplete information, dynamic versus static, and probabilistic versus deterministic scenarios. Then, we (1) Characterize the game-theoretic reasoning of LLMs; and (2) Perform LLM-vs.-LLM competitions as reasoning evaluation. We observe that (1) LLMs have distinct behaviors regarding various gaming scenarios; for example, LLMs fail in complete and deterministic games yet they are competitive in probabilistic gaming scenarios; (2) Most open-source LLMs, e.g., CodeLlama-34b-Instruct and Llama-2-70b-chat, are less competitive than commercial LLMs, e.g., GPT-4, in complex games, yet the recently released Llama-3-70b-Instruct makes up for this shortcoming. In addition, code-pretraining greatly benefits strategic reasoning, while advanced reasoning methods such as Chain-of-Thought (CoT) and Tree-of-Thought (ToT) do not always help. We further characterize the game-theoretic properties of LLMs, such as equilibrium and Pareto Efficiency in repeated games. Detailed error profiles are provided for a better understanding of LLMs' behavior. We hope our research provides standardized protocols and serves as a foundation to spur further explorations in the strategic reasoning of LLMs.

Computation and Language,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

This paper focuses on the evaluation of strategic reasoning ability of large language models (LLMs) in competitive environments. The authors test the logical and strategic reasoning abilities of LLMs through game theory tasks such as board games and card games. They create a language-driven environment called GTB ENCH, which includes 10 different types of gaming tasks covering scenarios of complete and incomplete information, dynamic and static situations, and determinism and probability. The study findings are as follows: 1. LLMs perform poorly in deterministic and complete information games, but relatively well in probabilistic games. 2. Open-source LLMs (such as CodeLlama-34b-Instruct and Llama-2-70b-chat) are generally less competitive in complex games compared to commercial LLMs (such as GPT-4), but this deficiency is being addressed by the recently released Llama-3-70b-Instruct. 3. Code pretraining is beneficial for strategic reasoning, while advanced reasoning methods (such as Chain-of-Thought and Tree-of-Thought) do not always improve performance. 4. LLMs exhibit different behavioral patterns when facing different game theoretical scenarios, and demonstrate characteristics of equilibrium and Pareto efficiency in repeated games. The paper also proposes direct competition between LLMs as a new evaluation method for reasoning and provides error analysis to better understand the behavior of LLMs. The aim of the research is to provide a standardized protocol for the evaluation of strategic reasoning in LLMs and to promote further exploration in this field.

GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations

TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs

GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents

LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language Models

LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models

LLMs for Relational Reasoning: How Far are We?

Benchmarking Large Language Model (LLM) Performance for Game Playing via Tic-Tac-Toe

Are Large Language Models Strategic Decision Makers? A Study of Performance and Bias in Two-Player Non-Zero-Sum Games

Can LLMs Reason in the Wild with Programs?

GameArena: Evaluating LLM Reasoning through Live Computer Games

Strategic Reasoning with Language Models

Lost in the Logic: An Evaluation of Large Language Models' Reasoning Capabilities on LSAT Logic Games

Explore the Reasoning Capability of LLMs in the Chess Testbed

Competition-Level Problems are Effective LLM Evaluators

LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models

LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models

CLR-Bench: Evaluating Large Language Models in College-level Reasoning

LLM as a Mastermind: A Survey of Strategic Reasoning with Large Language Models

Navigating the Labyrinth: Evaluating and Enhancing LLMs' Ability to Reason About Search Problems

Show, Don't Tell: Evaluating Large Language Models Beyond Textual Understanding with ChildPlay

How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments