Abstract:We explore the hypothesis that LLMs, such as GPT-3.5 and GPT-4, possess broader cognitive functions, particularly in non-linguistic domains. Our approach extends beyond standard linguistic benchmarks by incorporating games like Tic-Tac-Toe, Connect Four, and Battleship, encoded via ASCII, to assess strategic thinking and decision-making. To evaluate the models' ability to generalize beyond their training data, we introduce two additional games. The first game, LEGO Connect Language (LCL), tests the models' capacity to understand spatial logic and follow assembly instructions. The second game, the game of shapes, challenges the models to identify shapes represented by 1s within a matrix of zeros, further testing their spatial reasoning skills. This "show, don't tell" strategy uses games instead of simply querying the models. Our results show that despite their proficiency on standard benchmarks, GPT-3.5 and GPT-4's abilities to play and reason about fully observable games without pre-training is mediocre. Both models fail to anticipate losing moves in Tic-Tac-Toe and Connect Four, and they are unable to play Battleship correctly. While GPT-4 shows some success in the game of shapes, both models fail at the assembly tasks presented in the LCL game. These results suggest that while GPT models can emulate conversational proficiency and basic rule comprehension, their performance in strategic gameplay and spatial reasoning tasks is very limited. Importantly, this reveals a blind spot in current LLM benchmarks that we highlight with our gameplay benchmark suite ChildPlay (<a class="link-external link-https" href="https://github.com/child-play-neurips/child-play" rel="external noopener nofollow">this https URL</a>). Our findings provide a cautionary tale about claims of emergent intelligence and reasoning capabilities of LLMs that are roughly the size of GPT-3.5 and GPT-4.

Beyond Numeric Awards: In-Context Dueling Bandits with LLM Agents

How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments

Are Large Language Models Strategic Decision Makers? A Study of Performance and Bias in Two-Player Non-Zero-Sum Games

Decision-Making Behavior Evaluation Framework for LLMs under Uncertain Context

Nearly Optimal Algorithms for Contextual Dueling Bandits from Adversarial Feedback

Benchmarking Large Language Model (LLM) Performance for Game Playing via Tic-Tac-Toe

STRIDE: A Tool-Assisted LLM Agent Framework for Strategic and Interactive Decision-Making

UNO Arena for Evaluating Sequential Decision-Making Capability of Large Language Models

From Text to Tactic: Evaluating LLMs Playing the Game of Avalon

Put Your Money Where Your Mouth Is: Evaluating Strategic Planning and Execution of LLM Agents in an Auction Arena

Do LLM Agents Have Regret? A Case Study in Online Learning and Games

Introspective Tips: Large Language Model for In-Context Decision Making

LatEval: An Interactive LLMs Evaluation Benchmark with Incomplete Information from Lateral Thinking Puzzles

Show, Don't Tell: Evaluating Large Language Models Beyond Textual Understanding with ChildPlay

Strategic Behavior of Large Language Models: Game Structure vs. Contextual Framing

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

AvalonBench: Evaluating LLMs Playing the Game of Avalon

GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations

EVOLvE: Evaluating and Optimizing LLMs For Exploration

DeLLMa: Decision Making Under Uncertainty with Large Language Models