Abstract:Recently, large language model (LLM)-based agents have made significant advances across various fields. One of the most popular research areas involves applying these agents to video games. Traditionally, these methods have relied on game APIs to access in-game environmental and action data. However, this approach is limited by the availability of APIs and does not reflect how humans play games. With the advent of vision language models (VLMs), agents now have enhanced visual understanding capabilities, enabling them to interact with games using only visual inputs. Despite these advances, current approaches still face challenges in action-oriented tasks, particularly in action role-playing games (ARPGs), where reinforcement learning methods are prevalent but suffer from poor generalization and require extensive training. To address these limitations, we select an ARPG, ``Black Myth: Wukong'', as a research platform to explore the capability boundaries of existing VLMs in scenarios requiring visual-only input and complex action output. We define 12 tasks within the game, with 75% focusing on combat, and incorporate several state-of-the-art VLMs into this benchmark. Additionally, we will release a human operation dataset containing recorded gameplay videos and operation logs, including mouse and keyboard actions. Moreover, we propose a novel VARP (Vision Action Role-Playing) agent framework, consisting of an action planning system and a visual trajectory system. Our framework demonstrates the ability to perform basic tasks and succeed in 90% of easy and medium-level combat scenarios. This research aims to provide new insights and directions for applying multimodal agents in complex action game environments. The code and datasets will be made available at <a class="link-external link-https" href="https://varp-agent.github.io/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper primarily explores the application boundaries of Vision Language Models (VLM) in Action Role-Playing Games (ARPG) and proposes a new framework named V ARP (Vision Action Role-Playing). Specifically: 1. **Challenges of Visual Input**: - Existing methods mostly rely on game APIs to obtain environment and action data, but this approach is limited by the availability of APIs and does not align with how humans play games. - VLM can understand the game environment solely through visual input, but current methods still face challenges in tasks requiring complex action outputs. 2. **Limitations of Reinforcement Learning**: - In ARPGs, Reinforcement Learning (RL) methods, although common, have poor generalization capabilities and require extensive training time. - RL agents typically can only complete specific tasks in trained environments and perform poorly on other tasks. 3. **Choice of Experimental Platform**: - The game "Black Myth: Wukong" was chosen as the research platform, defining 12 tasks, with 75% of the tasks focused on combat scenarios. - Several state-of-the-art VLM models (such as GPT-4o) were included in this benchmark test to comprehensively explore their capability boundaries. 4. **Proposed New Framework**: - A V ARP framework was proposed, which includes an action planning system and a human-guided trajectory system. - Experimental validation showed that this framework could achieve a 90% success rate in basic and intermediate combat scenarios. In summary, this paper aims to overcome the limitations of existing methods in visual input and complex action output by introducing the new V ARP framework, thereby improving the performance and generalization capabilities of agents in ARPG games.

Can VLMs Play Action Role-Playing Games? Take Black Myth Wukong as a Study Case

Evaluating and Enhancing LLMs Agent based on Theory of Mind in Guandan: A Multi-Player Cooperative Game under Imperfect Information

GameVLM: A Decision-making Framework for Robotic Task Planning Based on Visual Language Models and Zero-sum Games

Game On: Towards Language Models as RL Experimenters

Voice2Action: Language Models as Agent for Efficient Real-Time Interaction in Virtual Reality

A Survey on Vision-Language-Action Models for Embodied AI

A Survey on Large Language Model-Based Game Agents

Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models

Crafting Dynamic Virtual Activities with Advanced Multimodal Models

Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld

From Goal-Conditioned to Language-Conditioned Agents via Vision-Language Models

Language Agents with Reinforcement Learning for Strategic Play in the Werewolf Game

Octopus: Embodied Vision-Language Programmer from Environmental Feedback

Tachikuma: Understading Complex Interactions with Multi-Character and Novel Objects by Large Language Models

3D-VLA: A 3D Vision-Language-Action Generative World Model

ING-VP: MLLMs cannot Play Easy Vision-based Games Yet

A3VLM: Actionable Articulation-Aware Vision Language Model

Behavioral Analysis of Vision-and-Language Navigation Agents

VLMimic: Vision Language Models are Visual Imitation Learner for Fine-grained Actions

Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks

VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents