Abstract:Large Language Models (LLMs) have been increasingly used in real-world settings, yet their strategic decision-making abilities remain largely unexplored. To fully benefit from the potential of LLMs, it's essential to understand their ability to function in complex social scenarios. Game theory, which is already used to understand real-world interactions, provides a good framework for assessing these abilities. This work investigates the performance and merits of LLMs in canonical game-theoretic two-player non-zero-sum games, Stag Hunt and Prisoner Dilemma. Our structured evaluation of GPT-3.5, GPT-4-Turbo, GPT-4o, and Llama-3-8B shows that these models, when making decisions in these games, are affected by at least one of the following systematic biases: positional bias, payoff bias, or behavioural bias. This indicates that LLMs do not fully rely on logical reasoning when making these strategic decisions. As a result, it was found that the LLMs' performance drops when the game configuration is misaligned with the affecting biases. When misaligned, GPT-3.5, GPT-4-Turbo, GPT-4o, and Llama-3-8B show an average performance drop of 32\%, 25\%, 34\%, and 29\% respectively in Stag Hunt, and 28\%, 16\%, 34\%, and 24\% respectively in Prisoner's Dilemma. Surprisingly, GPT-4o (a top-performing LLM across standard benchmarks) suffers the most substantial performance drop, suggesting that newer models are not addressing these issues. Interestingly, we found that a commonly used method of improving the reasoning capabilities of LLMs, chain-of-thought (CoT) prompting, reduces the biases in GPT-3.5, GPT-4o, and Llama-3-8B but increases the effect of the bias in GPT-4-Turbo, indicating that CoT alone cannot fully serve as a robust solution to this problem. We perform several additional experiments, which provide further insight into these observed behaviours.

Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve

Embers of autoregression show how large language models are shaped by the problem they are trained to solve

A Sentence is Worth a Thousand Pictures: Can Large Language Models Understand Hum4n L4ngu4ge and the W0rld behind W0rds?

Large Language Models and the Reverse Turing Test

Challenges and Contributing Factors in the Utilization of Large Language Models (LLMs)

A Perspective on Large Language Models, Intelligent Machines, and Knowledge Acquisition

Humanlike Cognitive Patterns as Emergent Phenomena in Large Language Models

Eight Things to Know about Large Language Models

Do Large Language Models Perform the Way People Expect? Measuring the Human Generalization Function

Are Large Language Models Strategic Decision Makers? A Study of Performance and Bias in Two-Player Non-Zero-Sum Games

Perils and opportunities in using large language models in psychological research

How to Measure the Intelligence of Large Language Models?

Using large language models in psychology

Assessing Hidden Risks of LLMs: An Empirical Study on Robustness, Consistency, and Credibility

The Importance of Understanding Language in Large Language Models

LLM4DS: Evaluating Large Language Models for Data Science Code Generation

Can large language models help predict results from a complex behavioural science study?

Misinforming LLMs: vulnerabilities, challenges and opportunities

Large Language Models are biased to overestimate profoundness

Beyond Probabilities: Unveiling the Misalignment in Evaluating Large Language Models

Large language models (LLMs): survey, technical frameworks, and future challenges