Abstract:Task-oriented grasping (TOG) refers to the problem of predicting grasps on an object that enable subsequent manipulation tasks. To model the complex relationships between objects, tasks, and grasps, existing methods incorporate semantic knowledge as priors into TOG pipelines. However, the existing semantic knowledge is typically constructed based on closed-world concept sets, restraining the generalization to novel concepts out of the pre-defined sets. To address this issue, we propose GraspGPT, a large language model (LLM) based TOG framework that leverages the open-end semantic knowledge from an LLM to achieve zero-shot generalization to novel concepts. We conduct experiments on Language Augmented TaskGrasp (LA-TaskGrasp) dataset and demonstrate that GraspGPT outperforms existing TOG methods on different held-out settings when generalizing to novel concepts out of the training set. The effectiveness of GraspGPT is further validated in real-robot experiments. Our code, data, appendix, and video are publicly available at <a class="link-external link-https" href="https://sites.google.com/view/graspgpt/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper aims to address a key issue in Task-Oriented Grasping (TOG), which is how to predict grasp actions that can accomplish subsequent operational tasks when faced with unseen concepts. Existing methods typically model the complex relationships between objects, tasks, and grasps by incorporating semantic knowledge as prior information into the TOG process. However, these methods are often based on a closed-world set of concepts, limiting their ability to generalize to new concepts. To overcome this limitation, this paper proposes GraspGPT, a TOG framework based on a large-scale language model (LLM). GraspGPT leverages the open semantic knowledge in LLMs to achieve zero-shot generalization to unseen new concepts. Specifically, when a user provides a natural language instruction containing new concepts, GraspGPT prompts the LLM to generate language description paragraphs about these new concepts, connecting them to related concepts described during training. This enables the robot to extend learned TOG skills from known concepts to new ones. ### Main Contributions 1. **Proposing GraspGPT**: A TOG framework based on LLM that utilizes open semantic knowledge to achieve zero-shot generalization to unseen new concepts. 2. **Constructing the LA-TaskGrasp Dataset**: A language-augmented TOG dataset containing automatically generated concept language descriptions, used to evaluate the performance of GraspGPT. ### Experimental Validation - **Perception Experiments**: Experiments conducted on the LA-TaskGrasp dataset show that GraspGPT outperforms existing TOG methods in terms of generalization performance under different settings (e.g., unseen object categories and tasks). - **Real Robot Experiments**: GraspGPT is deployed on the Kinova Gen3 robotic arm to verify its effectiveness in real-world applications. Experimental results demonstrate that GraspGPT excels in task-oriented grasping and tool manipulation. ### Conclusion By leveraging the open semantic knowledge in LLMs, GraspGPT successfully addresses the generalization problem faced by existing methods when dealing with unseen new concepts, providing a new solution for task-oriented grasping.

GraspGPT: Leveraging Semantic Knowledge from a Large Language Model for Task-Oriented Grasping

FoundationGrasp: Generalizable Task-Oriented Grasping with Foundation Models

Lan-grasp: Using Large Language Models for Semantic Object Grasping

RTAGrasp: Learning Task-Oriented Grasping from Human Videos via Retrieval, Transfer, and Alignment

Multi-GraspLLM: A Multimodal LLM for Multi-Hand Semantic Guided Grasp Generation

ShapeGrasp: Zero-Shot Task-Oriented Grasping with Large Language Models through Geometric Decomposition

Towards Open-World Grasping with Large Vision-Language Models

ThinkGrasp: A Vision-Language System for Strategic Part Grasping in Clutter

SegGrasp: Zero-Shot Task-Oriented Grasping via Semantic and Geometric Guided Segmentation

PhyGrasp: Generalizing Robotic Grasping with Physics-informed Large Multimodal Models

GLOVER: Generalizable Open-Vocabulary Affordance Reasoning for Task-Oriented Grasping

Decision-Making in Robotic Grasping with Large Language Models.

Language-driven Grasp Detection

SemGrasp: Semantic Grasp Generation via Language Aligned Discretization

Learning 6-DoF Fine-grained Grasp Detection Based on Part Affordance Grounding

VL-Grasp: a 6-Dof Interactive Grasp Policy for Language-Oriented Objects in Cluttered Indoor Scenes

GraspGF: Learning Score-based Grasping Primitive for Human-assisting Dexterous Grasping

Learning 6-DoF Object Poses to Grasp Category-level Objects by Language Instructions

Grasp as You Say: Language-guided Dexterous Grasp Generation

Language-Driven 6-DoF Grasp Detection Using Negative Prompt Guidance

Target-Oriented Object Grasping via Multimodal Human Guidance