GameVLM: A Decision-making Framework for Robotic Task Planning Based on Visual Language Models and Zero-sum Games

Aoran Mei,Jianhua Wang,Guo-Niu Zhu,Zhongxue Gan

2024-05-22

Abstract:With their prominent scene understanding and reasoning capabilities, pre-trained visual-language models (VLMs) such as GPT-4V have attracted increasing attention in robotic task planning. Compared with traditional task planning strategies, VLMs are strong in multimodal information parsing and code generation and show remarkable efficiency. Although VLMs demonstrate great potential in robotic task planning, they suffer from challenges like hallucination, semantic complexity, and limited context. To handle such issues, this paper proposes a multi-agent framework, i.e., GameVLM, to enhance the decision-making process in robotic task planning. In this study, VLM-based decision and expert agents are presented to conduct the task planning. Specifically, decision agents are used to plan the task, and the expert agent is employed to evaluate these task plans. Zero-sum game theory is introduced to resolve inconsistencies among different agents and determine the optimal solution. Experimental results on real robots demonstrate the efficacy of the proposed framework, with an average success rate of 83.3%.

Robotics,Artificial Intelligence

What problem does this paper attempt to address?

This paper proposes a framework called GameVLM to address decision-making problems in robot task planning. Traditional task planning strategies have poor adaptability and flexibility when dealing with unknown factors, while pre-trained visual language models (such as GPT-4V) perform well in multimodal information parsing and code generation but face challenges like illusions, semantic complexity, and limited context. To address these issues, the paper introduces a multi-agent framework based on zero-sum game theory. The GameVLM framework includes decision agents and expert agents. The decision agents are responsible for planning tasks and generating code, while the expert agents evaluate the consistency of these plans. When inconsistencies occur among different agents, zero-sum game theory is introduced to resolve conflicts. Through a question-and-answer mechanism, each agent challenges the strategies of other agents and is scored by the expert agent. Inconsistent answers lead to a deduction of points for the responder and a gain of points for the questioner, with the strategy with the highest score ultimately selected as the optimal solution. Experimental results show that GameVLM achieves an average success rate of 83.3% on real robots, demonstrating the effectiveness of the framework. The paper also evaluates GameVLM through six tasks with different characteristics, showing its excellence in understanding scenes, executing complex tasks, and imitating behaviors, while there is room for improvement in predicting future actions. In conclusion, GameVLM improves the decision accuracy and efficiency of robot task planning, particularly in tasks involving understanding and processing visual and spatial information, by leveraging visual language models and zero-sum game theory. Future research will focus on long-term task planning.

GameVLM: A Decision-making Framework for Robotic Task Planning Based on Visual Language Models and Zero-sum Games

Wonderful Team: Zero-Shot Physical Task Planning with Visual LLMs

Decision-Making in Robotic Grasping with Large Language Models.

Guiding Long-Horizon Task and Motion Planning with Vision Language Models

ReplanVLM: Replanning Robotic Tasks with Visual Language Models

Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning

Multi-agent Planning using Visual Language Models

Can VLMs Play Action Role-Playing Games? Take Black Myth Wukong as a Study Case

VeriGraph: Scene Graphs for Execution Verifiable Robot Planning

RoboGPT: an intelligent agent of making embodied long-term decisions for daily instruction tasks

Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning

STRIDE: A Tool-Assisted LLM Agent Framework for Strategic and Interactive Decision-Making

Self Generated Wargame AI: Double Layer Agent Task Planning Based on Large Language Model

ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting

VLN-Game: Vision-Language Equilibrium Search for Zero-Shot Semantic Navigation

Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks

RoboMP$^2$: A Robotic Multimodal Perception-Planning Framework with Multimodal Large Language Models

MLDT: Multi-Level Decomposition for Complex Long-Horizon Robotic Task Planning with Open-Source Large Language Model

RePLan: Robotic Replanning with Perception and Language Models

Breaking Down the Task: A Unit-Grained Hybrid Training Framework for Vision and Language Decision Making

LaMMA-P: Generalizable Multi-Agent Long-Horizon Task Allocation and Planning with LM-Driven PDDL Planner