Abstract:We explore the hypothesis that LLMs, such as GPT-3.5 and GPT-4, possess broader cognitive functions, particularly in non-linguistic domains. Our approach extends beyond standard linguistic benchmarks by incorporating games like Tic-Tac-Toe, Connect Four, and Battleship, encoded via ASCII, to assess strategic thinking and decision-making. To evaluate the models' ability to generalize beyond their training data, we introduce two additional games. The first game, LEGO Connect Language (LCL), tests the models' capacity to understand spatial logic and follow assembly instructions. The second game, the game of shapes, challenges the models to identify shapes represented by 1s within a matrix of zeros, further testing their spatial reasoning skills. This "show, don't tell" strategy uses games instead of simply querying the models. Our results show that despite their proficiency on standard benchmarks, GPT-3.5 and GPT-4's abilities to play and reason about fully observable games without pre-training is mediocre. Both models fail to anticipate losing moves in Tic-Tac-Toe and Connect Four, and they are unable to play Battleship correctly. While GPT-4 shows some success in the game of shapes, both models fail at the assembly tasks presented in the LCL game. These results suggest that while GPT models can emulate conversational proficiency and basic rule comprehension, their performance in strategic gameplay and spatial reasoning tasks is very limited. Importantly, this reveals a blind spot in current LLM benchmarks that we highlight with our gameplay benchmark suite ChildPlay (<a class="link-external link-https" href="https://github.com/child-play-neurips/child-play" rel="external noopener nofollow">this https URL</a>). Our findings provide a cautionary tale about claims of emergent intelligence and reasoning capabilities of LLMs that are roughly the size of GPT-3.5 and GPT-4.

Evaluating Shutdown Avoidance of Language Models in Textual Scenarios

A Looming Replication Crisis in Evaluating Behavior in Language Models? Evidence and Solutions

ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models

Discovering Language Model Behaviors with Model-Written Evaluations

Can Language Models Serve as Text-Based World Simulators?

Procedural Dilemma Generation for Evaluating Moral Reasoning in Humans and Language Models

Beyond Words: On Large Language Models Actionability in Mission-Critical Risk Analysis

Evaluating Psychological Safety of Large Language Models

PARADISE: Evaluating Implicit Planning Skills of Language Models with Procedural Warnings and Tips Dataset

Output Scouting: Auditing Large Language Models for Catastrophic Responses

Knowledge Graph Guided Evaluation of Abstention Techniques

Escalation Risks from Language Models in Military and Diplomatic Decision-Making

Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks

Evil Geniuses: Delving into the Safety of LLM-based Agents

Using cognitive psychology to understand GPT-3

Are Large Language Models Strategic Decision Makers? A Study of Performance and Bias in Two-Player Non-Zero-Sum Games

Walking a Tightrope -- Evaluating Large Language Models in High-Risk Domains

Safety Assessment of Chinese Large Language Models

Show, Don't Tell: Evaluating Large Language Models Beyond Textual Understanding with ChildPlay

Multilingual Jailbreak Challenges in Large Language Models

AI Sandbagging: Language Models can Strategically Underperform on Evaluations