Abstract:This study investigates the strategic decision-making abilities of large language models (LLMs) via the game of Tic-Tac-Toe, renowned for its straightforward rules and definitive outcomes. We developed a mobile application coupled with web services, facilitating gameplay among leading LLMs, including Jurassic-2 Ultra by AI21, Claude 2.1 by Anthropic, Gemini-Pro by Google, GPT-3.5-Turbo and GPT-4 by OpenAI, Llama2-70B by Meta, and Mistral Large by Mistral, to assess their rule comprehension and strategic thinking. Using a consistent prompt structure in 10 sessions for each LLM pair, we systematically collected data on wins, draws, and invalid moves across 980 games, employing two distinct prompt types to vary the presentation of the game's status. Our findings reveal significant performance variations among the LLMs. Notably, GPT-4, GPT-3.5-Turbo, and Llama2 secured the most wins with the list prompt, while GPT-4, Gemini-Pro, and Mistral Large excelled using the illustration prompt. GPT-4 emerged as the top performer, achieving victory with the minimum number of moves and the fewest errors for both prompt types. This research introduces a novel methodology for assessing LLM capabilities using a game that can illuminate their strategic thinking abilities. Beyond enhancing our comprehension of LLM performance, this study lays the groundwork for future exploration into their utility in complex decision-making scenarios, offering directions for further inquiry and the exploration of LLM limits within game-based frameworks.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate the strategic decision - making ability of large language models (LLMs) in games, especially to measure these models' understanding of rules and strategic thinking ability through the classic Tic - Tac - Toe game. Specifically, the main objectives of the research include: 1. **Evaluating the strategic decision - making ability of LLMs**: By having different LLMs participate in the Tic - Tac - Toe game, evaluate their performance in understanding and implementing the game rules, as well as their strategic thinking ability. 2. **Developing evaluation methods**: Create a mobile application and a Web service framework to enable LLMs to play Tic - Tac - Toe games independently and record data such as wins and losses, the number of draws, and invalid moves during the game. 3. **Comparing the performance of different LLMs**: Through systematic collection and analysis of data, compare the performance of multiple mainstream LLMs (such as GPT - 4, GPT - 3.5 - Turbo, Llama2, etc.) under different prompt types to find out which models perform better in specific situations. 4. **Exploring the potential of LLMs in complex decision - making scenarios**: Based on the simple but rule - explicit game of Tic - Tac - Toe, provide a basis for future research to evaluate the application potential of LLMs in more complex decision - making environments. ### Research Background With the development of large language models, researchers are increasingly concerned with how to effectively evaluate the capabilities of these models. Traditional natural language processing (NLP) benchmark tests (such as GLUE, SuperGLUE, etc.) mainly focus on language understanding and generation tasks and lack evaluation of strategic decision - making ability. Therefore, using games as a new evaluation tool can better understand the performance of LLMs in dynamic, interactive environments. As a classic game, Tic - Tac - Toe has simple rules, but it can effectively test the model's rule - following ability and basic strategic thinking. In addition, the results of Tic - Tac - Toe are clear and easy to analyze, making it an ideal experimental platform. ### Method Overview Researchers developed an Android application that allows LLMs to play Tic - Tac - Toe games through a Web API. The application supports the selection of different LLMs as players and provides two types of prompts (list prompts and diagrammatic prompts) to evaluate the performance of LLMs under different information presentation methods. By recording the results of each match, researchers can systematically analyze the performance of LLMs and draw conclusions. ### Conclusion The research shows that there are significant differences in the performance of different LLMs in the Tic - Tac - Toe game. For example, GPT - 4 performs excellently under both prompt types, winning more games and having the lowest error rate. This research provides an important reference for future exploration of the application of LLMs in complex decision - making scenarios.

Benchmarking Large Language Model (LLM) Performance for Game Playing via Tic-Tac-Toe

Evaluating Large Language Models with Grid-Based Game Competitions: An Extensible LLM Benchmark and Leaderboard

Show, Don't Tell: Evaluating Large Language Models Beyond Textual Understanding with ChildPlay

Are Large Language Models Strategic Decision Makers? A Study of Performance and Bias in Two-Player Non-Zero-Sum Games

Can Large Language Models Play Games? A Case Study of A Self-Play Approach

How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments

GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations

TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs

Atari-GPT: Benchmarking Multimodal Large Language Models as Low-Level Policies in Atari Games

GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents

LLM4DS: Evaluating Large Language Models for Data Science Code Generation

GameTraversalBenchmark: Evaluating Planning Abilities Of Large Language Models Through Traversing 2D Game Maps

Strategic Behavior of Large Language Models: Game Structure vs. Contextual Framing

LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language Models

MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration

From Text to Tactic: Evaluating LLMs Playing the Game of Avalon

Can Large Language Models Serve as Rational Players in Game Theory? A Systematic Analysis

Strategic behavior of large language models and the role of game structure versus contextual framing

SmartPlay: A Benchmark for LLMs as Intelligent Agents

Playing Games With GPT: What Can We Learn About a Large Language Model From Canonical Strategic Games?

Can Large Language Models Play Text Games Well? Current State-of-the-Art and Open Questions