Abstract:Large Language Models (LLMs) have been increasingly used in real-world settings, yet their strategic decision-making abilities remain largely unexplored. To fully benefit from the potential of LLMs, it's essential to understand their ability to function in complex social scenarios. Game theory, which is already used to understand real-world interactions, provides a good framework for assessing these abilities. This work investigates the performance and merits of LLMs in canonical game-theoretic two-player non-zero-sum games, Stag Hunt and Prisoner Dilemma. Our structured evaluation of GPT-3.5, GPT-4-Turbo, GPT-4o, and Llama-3-8B shows that these models, when making decisions in these games, are affected by at least one of the following systematic biases: positional bias, payoff bias, or behavioural bias. This indicates that LLMs do not fully rely on logical reasoning when making these strategic decisions. As a result, it was found that the LLMs' performance drops when the game configuration is misaligned with the affecting biases. When misaligned, GPT-3.5, GPT-4-Turbo, GPT-4o, and Llama-3-8B show an average performance drop of 32\%, 25\%, 34\%, and 29\% respectively in Stag Hunt, and 28\%, 16\%, 34\%, and 24\% respectively in Prisoner's Dilemma. Surprisingly, GPT-4o (a top-performing LLM across standard benchmarks) suffers the most substantial performance drop, suggesting that newer models are not addressing these issues. Interestingly, we found that a commonly used method of improving the reasoning capabilities of LLMs, chain-of-thought (CoT) prompting, reduces the biases in GPT-3.5, GPT-4o, and Llama-3-8B but increases the effect of the bias in GPT-4-Turbo, indicating that CoT alone cannot fully serve as a robust solution to this problem. We perform several additional experiments, which provide further insight into these observed behaviours.

How Many Parameters Does it Take to Change a Light Bulb? Evaluating Performance in Self-Play of Conversational Games as a Function of Model Characteristics

Clembench: Using Game Play to Evaluate Chat-Optimized Language Models as Conversational Agents

clembench-2024: A Challenging, Dynamic, Complementary, Multilingual Benchmark and Underlying Flexible Framework for LLMs as Multi-Action Agents

CogBench: a large language model walks into a psychology lab

Probing the Multi-turn Planning Capabilities of LLMs via 20 Question Games

Introspective Tips: Large Language Model for In-Context Decision Making

Can Large Language Models Play Games? A Case Study of A Self-Play Approach

SmartPlay: A Benchmark for LLMs as Intelligent Agents

Strategic Prompting for Conversational Tasks: A Comparative Analysis of Large Language Models Across Diverse Conversational Tasks

Beyond Prompts: Dynamic Conversational Benchmarking of Large Language Models

Auxiliary task demands mask the capabilities of smaller language models

GameEval: Evaluating LLMs on Conversational Games

Improving Language Model Negotiation with Self-Play and In-Context Learning from AI Feedback

Are Large Language Models Strategic Decision Makers? A Study of Performance and Bias in Two-Player Non-Zero-Sum Games

Revealing the structure of language model capabilities

How Different AI Chatbots Behave? Benchmarking Large Language Models in Behavioral Economics Games

Using Game Play to Investigate Multimodal and Conversational Grounding in Large Multimodal Models

SimulBench: Evaluating Language Models with Creative Simulation Tasks

Do Large Language Models Perform the Way People Expect? Measuring the Human Generalization Function

How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments

Talk Less, Interact Better: Evaluating In-context Conversational Adaptation in Multimodal LLMs