Abstract:Recent advancements in large language models (LLMs) have expanded their capabilities beyond traditional text-based tasks to multimodal domains, integrating visual, auditory, and textual data. While multimodal LLMs have been extensively explored for high-level planning in domains like robotics and games, their potential as low-level controllers remains largely untapped. In this paper, we introduce a novel benchmark aimed at testing the emergent capabilities of multimodal LLMs as low-level policies in Atari games. Unlike traditional reinforcement learning (RL) methods that require training for each new environment and reward function specification, these LLMs utilize pre-existing multimodal knowledge to directly engage with game environments. Our study assesses the performances of multiple multimodal LLMs against traditional RL agents, human players, and random agents, focusing on their ability to understand and interact with complex visual scenes and formulate strategic responses. Our results show that these multimodal LLMs are not yet capable of being zero-shot low-level policies. Furthermore, we see that this is, in part, due to their visual and spatial reasoning. Additional results and videos are available on our project webpage: <a class="link-external link-https" href="https://dev1nw.github.io/atari-gpt/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to evaluate the performance of multimodal large language models (LLMs) as low - level controllers in Atari games, especially their capabilities in the zero - shot situation. Specifically, the paper explores the following aspects: 1. **Low - level control capabilities of multimodal LLMs**: - Traditionally, reinforcement learning (RL) methods need to be specifically trained for each new environment and reward function, while multimodal LLMs use pre - existing multimodal knowledge to directly interact with the game environment. - The paper evaluates the low - level policy execution capabilities of multiple multimodal LLMs in Atari games by introducing a new benchmark, in order to verify whether they can be effective low - level controllers. 2. **Visual understanding and spatial reasoning**: - The researchers pay special attention to the understanding and spatial reasoning capabilities of these models when dealing with complex visual scenes. - The experimental results show that although multimodal LLMs perform well in some tasks, they have significant difficulties in spatial reasoning, which may be one of the reasons for their poor performance in the game environment. 3. **Comparison with existing methods**: - The paper compares the performance of multimodal LLMs with that of traditional RL agents, random agents, and human players to comprehensively evaluate their performance. - The results show that although multimodal LLMs fail to reach the level of humans or RL agents, they are still better than random agents and demonstrate a certain degree of understanding and decision - making capabilities. 4. **Feasibility of real - time decision - making**: - The research also explores the application potential of these models in real - time decision - making and finds that the current multimodal LLMs still have deficiencies in reasoning speed and cannot meet the requirements of real - time low - level control. In summary, this paper aims to explore the potential and limitations of multimodal LLMs in low - level control tasks, especially in the Atari game environment. Through systematic experiments and analysis, the researchers reveal the challenges of these models in visual understanding, spatial reasoning, and real - time decision - making, and provide valuable references for future research.

Atari-GPT: Benchmarking Multimodal Large Language Models as Low-Level Policies in Atari Games

Benchmarking Large Language Model (LLM) Performance for Game Playing via Tic-Tac-Toe

Evaluating Large Language Models with Grid-Based Game Competitions: An Extensible LLM Benchmark and Leaderboard

Show, Don't Tell: Evaluating Large Language Models Beyond Textual Understanding with ChildPlay

How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments

SmartPlay: A Benchmark for LLMs as Intelligent Agents

From Text to Tactic: Evaluating LLMs Playing the Game of Avalon

MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration

GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents

From Goal-Conditioned to Language-Conditioned Agents via Vision-Language Models

Agent57: Outperforming the Atari Human Benchmark

Are Large Language Models Strategic Decision Makers? A Study of Performance and Bias in Two-Player Non-Zero-Sum Games

Large Language Models as Generalizable Policies for Embodied Tasks

clembench-2024: A Challenging, Dynamic, Complementary, Multilingual Benchmark and Underlying Flexible Framework for LLMs as Multi-Action Agents

LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models

Language Models as Zero-Shot Trajectory Generators

GameTraversalBenchmark: Evaluating Planning Abilities Of Large Language Models Through Traversing 2D Game Maps

AvalonBench: Evaluating LLMs Playing the Game of Avalon

Model-Based Reinforcement Learning for Atari

Large Language Models Playing Mixed Strategy Nash Equilibrium Games