Abstract:Generalizable articulated object manipulation is essential for home-assistant robots. Recent efforts focus on imitation learning from demonstrations or reinforcement learning in simulation, however, due to the prohibitive costs of real-world data collection and precise object simulation, it still remains challenging for these works to achieve broad adaptability across diverse articulated objects. Recently, many works have tried to utilize the strong in-context learning ability of Large Language Models (LLMs) to achieve generalizable robotic manipulation, but most of these researches focus on high-level task planning, sidelining low-level robotic control. In this work, building on the idea that the kinematic structure of the object determines how we can manipulate it, we propose a kinematic-aware prompting framework that prompts LLMs with kinematic knowledge of objects to generate low-level motion trajectory waypoints, supporting various object manipulation. To effectively prompt LLMs with the kinematic structure of different objects, we design a unified kinematic knowledge parser, which represents various articulated objects as a unified textual description containing kinematic joints and contact location. Building upon this unified description, a kinematic-aware planner model is proposed to generate precise 3D manipulation waypoints via a designed kinematic-aware chain-of-thoughts prompting method. Our evaluation spanned 48 instances across 16 distinct categories, revealing that our framework not only outperforms traditional methods on 8 seen categories but also shows a powerful zero-shot capability for 8 unseen articulated object categories. Moreover, the real-world experiments on 7 different object categories prove our framework's adaptability in practical scenarios. Code is released at <a class="link-external link-https" href="https://github.com/GeWu-Lab/LLM_articulated_object_manipulation/tree/main" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to use large - language models (LLMs) to achieve general manipulation of articulated objects, especially in the case of extremely low data requirements, being able to achieve zero - sample manipulation of unseen articulated object instances and categories. Traditional methods rely on a large amount of robotic data, and the collection of such data is costly and difficult to simulate complex situations in the real world. In addition, existing methods based on LLMs mainly focus on high - level task planning and pay less attention to low - level robot control, which limits their ability to handle complex articulated object manipulation tasks. To solve these problems, the authors propose a method based on the kinematics - aware prompting framework. This method extracts the kinematic knowledge of articulated objects to guide LLMs to generate accurate 3D manipulation trajectory points, thereby achieving general manipulation of articulated objects. Specifically, the main contributions of the paper include: 1. **Proposing the kinematics - aware prompting framework**: aiming to reduce the demand for robotic data and achieve general manipulation of new articulated object instances and categories. 2. **Designing unified kinematic knowledge parser and kinematics - aware planner components**: using the kinematic knowledge of objects to guide LLMs to generate accurate 3D manipulation trajectory points. 3. **Extensive experimental verification**: Experiments were carried out on 48 object instances of different categories, demonstrating the effectiveness of this framework in the zero - sample situation for articulated object manipulation and showing its generalization ability in practical scenarios. Through this method, the paper not only improves the performance of LLMs in articulated object manipulation tasks but also reduces the dependence on a large amount of demonstration data, enabling robots to operate various articulated objects more flexibly in diverse environments.

Kinematic-aware Prompting for Generalizable Articulated Object Manipulation with LLMs

Empowering Large Language Models on Robotic Manipulation with Affordance Prompting

ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation

LLM+ A: Grounding Large Language Models in Physical World with Affordance Prompting

Generalizable Long-Horizon Manipulations with Large Language Models

AIC MLLM: Autonomous Interactive Correction MLLM for Robust Robotic Manipulation

Non-Prehensile Tool-Object Manipulation by Integrating LLM-Based Planning and Manoeuvrability-Driven Controls

NaturalVLM: Leveraging Fine-grained Natural Language for Affordance-Guided Visual Manipulation

Learning Instruction-Guided Manipulation Affordance via Large Models for Embodied Robotic Tasks

LLM-Based Human-Robot Collaboration Framework for Manipulation Tasks

Task and Motion Planning with Large Language Models for Object Rearrangement

Learning Task Planning from Multi-Modal Demonstration for Multi-Stage Contact-Rich Manipulation

Leveraging Commonsense Knowledge from Large Language Models for Task and Motion Planning

RLAfford: End-to-End Affordance Learning for Robotic Manipulation

Large Language Models for Orchestrating Bimanual Robots

SAGE: Bridging Semantic and Actionable Parts for GEneralizable Articulated-Object Manipulation under Language Instructions.

MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting

Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model

SAGE: Bridging Semantic and Actionable Parts for GEneralizable Manipulation of Articulated Objects

KinScene: Model-Based Mobile Manipulation of Articulated Scenes

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models