GPT-4V(ision) for Robotics: Multimodal Task Planning from Human Demonstration

Naoki Wake,Atsushi Kanehira,Kazuhiro Sasabuchi,Jun Takamatsu,Katsushi Ikeuchi

2024-09-27

Abstract:We introduce a pipeline that enhances a general-purpose Vision Language Model, GPT-4V(ision), to facilitate one-shot visual teaching for robotic manipulation. This system analyzes videos of humans performing tasks and outputs executable robot programs that incorporate insights into affordances. The process begins with GPT-4V analyzing the videos to obtain textual explanations of environmental and action details. A GPT-4-based task planner then encodes these details into a symbolic task plan. Subsequently, vision systems spatially and temporally ground the task plan in the videos. Objects are identified using an open-vocabulary object detector, and hand-object interactions are analyzed to pinpoint moments of grasping and releasing. This spatiotemporal grounding allows for the gathering of affordance information (e.g., grasp types, waypoints, and body postures) critical for robot execution. Experiments across various scenarios demonstrate the method's efficacy in enabling real robots to operate from one-shot human demonstrations. Meanwhile, quantitative tests have revealed instances of hallucination in GPT-4V, highlighting the importance of incorporating human supervision within the pipeline. The prompts of GPT-4V/GPT-4 are available at this project page: <a class="link-external link-https" href="https://microsoft.github.io/GPT4Vision-Robot-Manipulation-Prompts/" rel="external noopener nofollow">this https URL</a>

Robotics,Computation and Language,Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to use existing large - scale visual - language models (such as GPT - 4V) and large - language models (such as GPT - 4) to generate task plans that robots can execute through one - time human demonstration videos. Specifically, the paper proposes a multimodal task planner, which can analyze videos of humans performing tasks and extract symbolic task plans and environmental information required for operating objects (i.e., "affordance", referring to the potential action possibilities provided by objects in the environment) from them. The goal of this system is to reduce the dependence on a large number of specific datasets, improve the adaptability and operating ability of robots in different scenarios, and at the same time ensure the effective combination of task plans and actual environmental information to support the efficient execution of robots. The main contributions of the paper include: 1. Proposing a plug - and - play multimodal task planner, which utilizes off - the - shelf visual - language models and large - language models; 2. Proposing a method for aligning the recognition results of GPT - 4V with the environmental information required for robot operations; 3. Making the code public and providing a practical resource for the robot research community. Through this method, researchers hope that without additional training, robots can perform complex tasks according to one - time human demonstration videos only by modifying prompts. This not only improves the flexibility and reusability of the system but also reduces the development cost and time.

GPT-4V(ision) for Robotics: Multimodal Task Planning from Human Demonstration

Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning

EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought

AlphaBlock: Embodied Finetuning for Vision-Language Reasoning in Robot Manipulation

Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V

Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving

GPT-4V(ision) as a Generalist Evaluator for Vision-Language Tasks

VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model

MM-VID: Advancing Video Understanding with GPT-4V(ision)

GPT Models Meet Robotic Applications: Co-Speech Gesturing Chat System

Enhancing Interpretability and Interactivity in Robot Manipulation: A Neurosymbolic Approach

This&That: Language-Gesture Controlled Video Generation for Robot Planning

VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding

RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics

Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning

Exploring Recommendation Capabilities of GPT-4V(ision): A Preliminary Case Study

GPT-4 Vision: Multi-Modal Evolution of ChatGPT and Potential Role in Radiology

PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs

UnifiedVisionGPT: Streamlining Vision-Oriented AI through Generalized Multimodal Framework

ROSGPT_Vision: Commanding Robots Using Only Language Models' Prompts