GameVLM: A Decision-making Framework for Robotic Task Planning Based on Visual Language Models and Zero-sum Games

Aoran Mei,Jianhua Wang,Guo-Niu Zhu,Zhongxue Gan
2024-05-22
Abstract:With their prominent scene understanding and reasoning capabilities, pre-trained visual-language models (VLMs) such as GPT-4V have attracted increasing attention in robotic task planning. Compared with traditional task planning strategies, VLMs are strong in multimodal information parsing and code generation and show remarkable efficiency. Although VLMs demonstrate great potential in robotic task planning, they suffer from challenges like hallucination, semantic complexity, and limited context. To handle such issues, this paper proposes a multi-agent framework, i.e., GameVLM, to enhance the decision-making process in robotic task planning. In this study, VLM-based decision and expert agents are presented to conduct the task planning. Specifically, decision agents are used to plan the task, and the expert agent is employed to evaluate these task plans. Zero-sum game theory is introduced to resolve inconsistencies among different agents and determine the optimal solution. Experimental results on real robots demonstrate the efficacy of the proposed framework, with an average success rate of 83.3%.
Robotics,Artificial Intelligence
What problem does this paper attempt to address?
This paper proposes a framework called GameVLM to address decision-making problems in robot task planning. Traditional task planning strategies have poor adaptability and flexibility when dealing with unknown factors, while pre-trained visual language models (such as GPT-4V) perform well in multimodal information parsing and code generation but face challenges like illusions, semantic complexity, and limited context. To address these issues, the paper introduces a multi-agent framework based on zero-sum game theory. The GameVLM framework includes decision agents and expert agents. The decision agents are responsible for planning tasks and generating code, while the expert agents evaluate the consistency of these plans. When inconsistencies occur among different agents, zero-sum game theory is introduced to resolve conflicts. Through a question-and-answer mechanism, each agent challenges the strategies of other agents and is scored by the expert agent. Inconsistent answers lead to a deduction of points for the responder and a gain of points for the questioner, with the strategy with the highest score ultimately selected as the optimal solution. Experimental results show that GameVLM achieves an average success rate of 83.3% on real robots, demonstrating the effectiveness of the framework. The paper also evaluates GameVLM through six tasks with different characteristics, showing its excellence in understanding scenes, executing complex tasks, and imitating behaviors, while there is room for improvement in predicting future actions. In conclusion, GameVLM improves the decision accuracy and efficiency of robot task planning, particularly in tasks involving understanding and processing visual and spatial information, by leveraging visual language models and zero-sum game theory. Future research will focus on long-term task planning.