Abstract:Robotic grasping is a fundamental ability for a robot to interact with the environment. Current methods focus on how to obtain a stable and reliable grasping pose in object level, while little work has been studied on part (shape)-wise grasping which is related to fine-grained grasping and robotic affordance. Parts can be seen as atomic elements to compose an object, which contains rich semantic knowledge and a strong correlation with affordance. However, lacking a large part-wise 3D robotic dataset limits the development of part representation learning and downstream applications. In this paper, we propose a new large Language-guided SHape grAsPing datasEt (named LangSHAPE) to promote 3D part-level affordance and grasping ability learning. From the perspective of robotic cognition, we design a two-stage fine-grained robotic grasping framework (named LangPartGPD), including a novel 3D part language grounding model and a part-aware grasp pose detection model, in which explicit language input from human or large language models (LLMs) could guide a robot to generate part-level 6-DoF grasping pose with textual explanation. Our method combines the advantages of human-robot collaboration and LLMs' planning ability using explicit language as a symbolic intermediate. To evaluate the effectiveness of our proposed method, we perform 3D part grounding and fine-grained grasp detection experiments on both simulation and physical robot settings, following language instructions across different degrees of textual complexity. Results show our method achieves competitive performance in 3D geometry fine-grained grounding, object affordance inference, and 3D part-aware grasping tasks. Our dataset and code are available on our project website <a class="link-external link-https" href="https://sites.google.com/view/lang-shape" rel="external noopener nofollow">this https URL</a>

Hierarchical Multi-modal Fusion for Language-conditioned Robotic Grasping Detection in Clutter

LAC-Net: Linear-Fusion Attention-Guided Convolutional Network for Accurate Robotic Grasping Under the Occlusion

Bilateral Cross-Modal Fusion Network for Robot Grasp Detection

A robot grasping detection network based on flexible selection of multi-modal feature fusion structure

A Joint Modeling of Vision-Language-Action for Target-oriented Grasping in Clutter

Visual-tactile Fusion for Transparent Object Grasping in Complex Backgrounds

A novel integrated method of detection-grasping for specific object based on the box coordinate matching

Two-stage Grasp Detection Method for Robotics Using Point Clouds and Deep Hierarchical Feature Learning Network

A Parameter-Efficient Tuning Framework for Language-guided Object Grounding and Robot Grasping

Language Guided Robotic Grasping with Fine-Grained Instructions

Robotic Grasp Detection Using Structure Prior Attention and Multiscale Features

Visual-and-Language Multimodal Fusion for Sweeping Robot Navigation Based on CNN and GRU

Multi-Stage Image-Language Cross-Generative Fusion Network for Video-Based Referring Expression Comprehension

GraspMamba: A Mamba-based Language-driven Grasp Detection Framework with Hierarchical Feature Learning

A YOLO-GGCNN based grasping framework for mobile robots in unknown environments

Learning 6-DoF Fine-grained Grasp Detection Based on Part Affordance Grounding

Efficient Fully Convolutional Network and Optimization Approach for Robotic Grasping Detection Based on RGB-D Images

Learning 6-DoF Object Poses to Grasp Category-level Objects by Language Instructions

Robotic Grasp Detection Network Based on Improved Deformable Convolution and Spatial Feature Center Mechanism

A neural learning approach for simultaneous object detection and grasp detection in cluttered scenes

Object Detection and Information Perception by Fusing YOLO-SCG and Point Cloud Clustering