GPT-4V(ision) for Robotics: Multimodal Task Planning from Human Demonstration

Naoki Wake,Atsushi Kanehira,Kazuhiro Sasabuchi,Jun Takamatsu,Katsushi Ikeuchi
2024-09-27
Abstract:We introduce a pipeline that enhances a general-purpose Vision Language Model, GPT-4V(ision), to facilitate one-shot visual teaching for robotic manipulation. This system analyzes videos of humans performing tasks and outputs executable robot programs that incorporate insights into affordances. The process begins with GPT-4V analyzing the videos to obtain textual explanations of environmental and action details. A GPT-4-based task planner then encodes these details into a symbolic task plan. Subsequently, vision systems spatially and temporally ground the task plan in the videos. Objects are identified using an open-vocabulary object detector, and hand-object interactions are analyzed to pinpoint moments of grasping and releasing. This spatiotemporal grounding allows for the gathering of affordance information (e.g., grasp types, waypoints, and body postures) critical for robot execution. Experiments across various scenarios demonstrate the method's efficacy in enabling real robots to operate from one-shot human demonstrations. Meanwhile, quantitative tests have revealed instances of hallucination in GPT-4V, highlighting the importance of incorporating human supervision within the pipeline. The prompts of GPT-4V/GPT-4 are available at this project page: <a class="link-external link-https" href="https://microsoft.github.io/GPT4Vision-Robot-Manipulation-Prompts/" rel="external noopener nofollow">this https URL</a>
Robotics,Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to use existing large - scale visual - language models (such as GPT - 4V) and large - language models (such as GPT - 4) to generate task plans that robots can execute through one - time human demonstration videos. Specifically, the paper proposes a multimodal task planner, which can analyze videos of humans performing tasks and extract symbolic task plans and environmental information required for operating objects (i.e., "affordance", referring to the potential action possibilities provided by objects in the environment) from them. The goal of this system is to reduce the dependence on a large number of specific datasets, improve the adaptability and operating ability of robots in different scenarios, and at the same time ensure the effective combination of task plans and actual environmental information to support the efficient execution of robots. The main contributions of the paper include: 1. Proposing a plug - and - play multimodal task planner, which utilizes off - the - shelf visual - language models and large - language models; 2. Proposing a method for aligning the recognition results of GPT - 4V with the environmental information required for robot operations; 3. Making the code public and providing a practical resource for the robot research community. Through this method, researchers hope that without additional training, robots can perform complex tasks according to one - time human demonstration videos only by modifying prompts. This not only improves the flexibility and reusability of the system but also reduces the development cost and time.