Abstract:Large Language Models (LLMs) have been increasingly used in real-world settings, yet their strategic decision-making abilities remain largely unexplored. To fully benefit from the potential of LLMs, it's essential to understand their ability to function in complex social scenarios. Game theory, which is already used to understand real-world interactions, provides a good framework for assessing these abilities. This work investigates the performance and merits of LLMs in canonical game-theoretic two-player non-zero-sum games, Stag Hunt and Prisoner Dilemma. Our structured evaluation of GPT-3.5, GPT-4-Turbo, GPT-4o, and Llama-3-8B shows that these models, when making decisions in these games, are affected by at least one of the following systematic biases: positional bias, payoff bias, or behavioural bias. This indicates that LLMs do not fully rely on logical reasoning when making these strategic decisions. As a result, it was found that the LLMs' performance drops when the game configuration is misaligned with the affecting biases. When misaligned, GPT-3.5, GPT-4-Turbo, GPT-4o, and Llama-3-8B show an average performance drop of 32\%, 25\%, 34\%, and 29\% respectively in Stag Hunt, and 28\%, 16\%, 34\%, and 24\% respectively in Prisoner's Dilemma. Surprisingly, GPT-4o (a top-performing LLM across standard benchmarks) suffers the most substantial performance drop, suggesting that newer models are not addressing these issues. Interestingly, we found that a commonly used method of improving the reasoning capabilities of LLMs, chain-of-thought (CoT) prompting, reduces the biases in GPT-3.5, GPT-4o, and Llama-3-8B but increases the effect of the bias in GPT-4-Turbo, indicating that CoT alone cannot fully serve as a robust solution to this problem. We perform several additional experiments, which provide further insight into these observed behaviours.

Instructed to Bias: Instruction-Tuned Language Models Exhibit Emergent Cognitive Bias

Cognitive Bias in Decision-Making with LLMs

Reducing Selection Bias in Large Language Models

Metacognitive Myopia in Large Language Models

Cognitive Biases in Large Language Models: A Survey and Mitigation Experiments

Are Large Language Models Strategic Decision Makers? A Study of Performance and Bias in Two-Player Non-Zero-Sum Games

AI Can Be Cognitively Biased: An Exploratory Study on Threshold Priming in LLM-Based Batch Relevance Assessment

Generative Language Models Exhibit Social Identity Biases

Cognitive bias in large language models: Cautious optimism meets anti-Panglossian meliorism

Intentional Biases in LLM Responses

Mind vs. Mouth: On Measuring Re-judge Inconsistency of Social Bias in Large Language Models

Evaluating Large Language Model Biases in Persona-Steered Generation

From Bytes to Biases: Investigating the Cultural Self-Perception of Large Language Models

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric Perspective

AI AI Bias: Large Language Models Favor Their Own Generated Content

The Life Cycle of Large Language Models: A Review of Biases in Education

Large Language Models are Biased Reinforcement Learners

Can Instruction Fine-Tuned Language Models Identify Social Bias through Prompting?

Systematic Biases in LLM Simulations of Debates