Benchmarking Large Language Model (LLM) Performance for Game Playing via Tic-Tac-Toe

Oguzhan Topsakal,Jackson B. Harper
DOI: https://doi.org/10.3390/electronics13081532
IF: 2.9
2024-04-18
Electronics
Abstract:This study investigates the strategic decision-making abilities of large language models (LLMs) via the game of Tic-Tac-Toe, renowned for its straightforward rules and definitive outcomes. We developed a mobile application coupled with web services, facilitating gameplay among leading LLMs, including Jurassic-2 Ultra by AI21, Claude 2.1 by Anthropic, Gemini-Pro by Google, GPT-3.5-Turbo and GPT-4 by OpenAI, Llama2-70B by Meta, and Mistral Large by Mistral, to assess their rule comprehension and strategic thinking. Using a consistent prompt structure in 10 sessions for each LLM pair, we systematically collected data on wins, draws, and invalid moves across 980 games, employing two distinct prompt types to vary the presentation of the game's status. Our findings reveal significant performance variations among the LLMs. Notably, GPT-4, GPT-3.5-Turbo, and Llama2 secured the most wins with the list prompt, while GPT-4, Gemini-Pro, and Mistral Large excelled using the illustration prompt. GPT-4 emerged as the top performer, achieving victory with the minimum number of moves and the fewest errors for both prompt types. This research introduces a novel methodology for assessing LLM capabilities using a game that can illuminate their strategic thinking abilities. Beyond enhancing our comprehension of LLM performance, this study lays the groundwork for future exploration into their utility in complex decision-making scenarios, offering directions for further inquiry and the exploration of LLM limits within game-based frameworks.
engineering, electrical & electronic,computer science, information systems,physics, applied
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the strategic decision - making ability of large language models (LLMs) in games, especially to measure these models' understanding of rules and strategic thinking ability through the classic Tic - Tac - Toe game. Specifically, the main objectives of the research include: 1. **Evaluating the strategic decision - making ability of LLMs**: By having different LLMs participate in the Tic - Tac - Toe game, evaluate their performance in understanding and implementing the game rules, as well as their strategic thinking ability. 2. **Developing evaluation methods**: Create a mobile application and a Web service framework to enable LLMs to play Tic - Tac - Toe games independently and record data such as wins and losses, the number of draws, and invalid moves during the game. 3. **Comparing the performance of different LLMs**: Through systematic collection and analysis of data, compare the performance of multiple mainstream LLMs (such as GPT - 4, GPT - 3.5 - Turbo, Llama2, etc.) under different prompt types to find out which models perform better in specific situations. 4. **Exploring the potential of LLMs in complex decision - making scenarios**: Based on the simple but rule - explicit game of Tic - Tac - Toe, provide a basis for future research to evaluate the application potential of LLMs in more complex decision - making environments. ### Research Background With the development of large language models, researchers are increasingly concerned with how to effectively evaluate the capabilities of these models. Traditional natural language processing (NLP) benchmark tests (such as GLUE, SuperGLUE, etc.) mainly focus on language understanding and generation tasks and lack evaluation of strategic decision - making ability. Therefore, using games as a new evaluation tool can better understand the performance of LLMs in dynamic, interactive environments. As a classic game, Tic - Tac - Toe has simple rules, but it can effectively test the model's rule - following ability and basic strategic thinking. In addition, the results of Tic - Tac - Toe are clear and easy to analyze, making it an ideal experimental platform. ### Method Overview Researchers developed an Android application that allows LLMs to play Tic - Tac - Toe games through a Web API. The application supports the selection of different LLMs as players and provides two types of prompts (list prompts and diagrammatic prompts) to evaluate the performance of LLMs under different information presentation methods. By recording the results of each match, researchers can systematically analyze the performance of LLMs and draw conclusions. ### Conclusion The research shows that there are significant differences in the performance of different LLMs in the Tic - Tac - Toe game. For example, GPT - 4 performs excellently under both prompt types, winning more games and having the lowest error rate. This research provides an important reference for future exploration of the application of LLMs in complex decision - making scenarios.