Atari-GPT: Benchmarking Multimodal Large Language Models as Low-Level Policies in Atari Games

Nicholas R. Waytowich,Devin White,MD Sunbeam,Vinicius G. Goecks
2024-12-02
Abstract:Recent advancements in large language models (LLMs) have expanded their capabilities beyond traditional text-based tasks to multimodal domains, integrating visual, auditory, and textual data. While multimodal LLMs have been extensively explored for high-level planning in domains like robotics and games, their potential as low-level controllers remains largely untapped. In this paper, we introduce a novel benchmark aimed at testing the emergent capabilities of multimodal LLMs as low-level policies in Atari games. Unlike traditional reinforcement learning (RL) methods that require training for each new environment and reward function specification, these LLMs utilize pre-existing multimodal knowledge to directly engage with game environments. Our study assesses the performances of multiple multimodal LLMs against traditional RL agents, human players, and random agents, focusing on their ability to understand and interact with complex visual scenes and formulate strategic responses. Our results show that these multimodal LLMs are not yet capable of being zero-shot low-level policies. Furthermore, we see that this is, in part, due to their visual and spatial reasoning. Additional results and videos are available on our project webpage: <a class="link-external link-https" href="https://dev1nw.github.io/atari-gpt/" rel="external noopener nofollow">this https URL</a>.
Artificial Intelligence
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to evaluate the performance of multimodal large language models (LLMs) as low - level controllers in Atari games, especially their capabilities in the zero - shot situation. Specifically, the paper explores the following aspects: 1. **Low - level control capabilities of multimodal LLMs**: - Traditionally, reinforcement learning (RL) methods need to be specifically trained for each new environment and reward function, while multimodal LLMs use pre - existing multimodal knowledge to directly interact with the game environment. - The paper evaluates the low - level policy execution capabilities of multiple multimodal LLMs in Atari games by introducing a new benchmark, in order to verify whether they can be effective low - level controllers. 2. **Visual understanding and spatial reasoning**: - The researchers pay special attention to the understanding and spatial reasoning capabilities of these models when dealing with complex visual scenes. - The experimental results show that although multimodal LLMs perform well in some tasks, they have significant difficulties in spatial reasoning, which may be one of the reasons for their poor performance in the game environment. 3. **Comparison with existing methods**: - The paper compares the performance of multimodal LLMs with that of traditional RL agents, random agents, and human players to comprehensively evaluate their performance. - The results show that although multimodal LLMs fail to reach the level of humans or RL agents, they are still better than random agents and demonstrate a certain degree of understanding and decision - making capabilities. 4. **Feasibility of real - time decision - making**: - The research also explores the application potential of these models in real - time decision - making and finds that the current multimodal LLMs still have deficiencies in reasoning speed and cannot meet the requirements of real - time low - level control. In summary, this paper aims to explore the potential and limitations of multimodal LLMs in low - level control tasks, especially in the Atari game environment. Through systematic experiments and analysis, the researchers reveal the challenges of these models in visual understanding, spatial reasoning, and real - time decision - making, and provide valuable references for future research.