Abstract:In the context of human-robot interaction and collaboration scenarios, robotic grasping still encounters numerous challenges. Traditional grasp detection methods generally analyze the entire scene to predict grasps, leading to redundancy and inefficiency. In this work, we reconsider 6-DoF grasp detection from a target-referenced perspective and propose a Target-Oriented Grasp Network (TOGNet). TOGNet specifically targets local, object-agnostic region patches to predict grasps more efficiently. It integrates seamlessly with multimodal human guidance, including language instructions, pointing gestures, and interactive clicks. Thus our system comprises two primary functional modules: a guidance module that identifies the target object in 3D space and TOGNet, which detects region-focal 6-DoF grasps around the target, facilitating subsequent motion planning. Through 50 target-grasping simulation experiments in cluttered scenes, our system achieves a success rate improvement of about 13.7%. In real-world experiments, we demonstrate that our method excels in various target-oriented grasping scenarios.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenges encountered by robots when grasping objects in human - machine interaction and collaboration scenarios. Specifically, traditional grasp detection methods usually analyze the entire scene to predict grasping actions, which leads to redundancy and inefficiency. This paper reconsiders 6 - degrees - of - freedom (6 - DoF) grasp detection and proposes a new model named Target - Oriented GraspNetwork (TOGNet) from the perspective of target reference. TOGNet focuses on local, object - independent region patches to predict grasping actions more efficiently and seamlessly integrates multi - modal human guidance, including language instructions, pointing gestures and interactive clicks. In this way, the system can perform more accurate target - oriented grasping in cluttered environments. ### Main Problems 1. **Inefficiency of Traditional Grasp Detection Methods**: Traditional methods analyze the entire scene to predict grasping actions, resulting in a waste of computational resources and low efficiency. 2. **Insufficient Grasping Precision in Complex Environments**: In cluttered scenes, traditional methods are difficult to generate high - quality grasping points, especially when facing new scenes. 3. **Lack of Specificity for Specific Targets**: Existing methods often require additional computational steps, such as target segmentation and grasp filtering, when dealing with specific target grasping, increasing unnecessary computational burden. ### Solutions To solve the above problems, this paper proposes a new framework with the following main contributions: 1. **Constructing a Multi - Modal Guidance Pipeline**: The system integrates multiple state - of - the - art computer vision models, which can analyze and understand human intentions and efficiently crop target objects. This system is especially helpful for assisting people with visual, auditory or motor disabilities. 2. **Designing Target - Oriented GraspNetwork (TOGNet)**: TOGNet aims to detect 6 - DoF grasping postures from the target - reference area, thereby simplifying the robot's motion planning process. TOGNet is trained on a fine - grained region - focused dataset, which is object - independent and therefore still effective when facing new scenes. 3. **Evaluating the System's Performance**: The system is evaluated through datasets, simulation experiments and real - world experiments. Specifically, a new evaluation metric is proposed to adapt to the target - oriented setting, and the grasping quality is compared with the recent state - of - the - art methods. In addition, the target - oriented grasping success rate in 50 cluttered scenes is evaluated on the Maniskill2 benchmark, and the system's ability to understand and respond to multi - modal human guidance is verified on an actual robot platform. ### Formula Representation In describing the specific grasping prediction process, some formulas are used in the paper to represent the prediction process of grasping postures. For example, assume that multiple RGB - D region patches are extracted from the guidance module, and the goal is to predict the 6 - DoF grasping posture: \[ G_p=\Phi(f_i|i = 1,\ldots,K) \] where \( f_i\in\mathbb{R}^{N\times3} \) is the \( i \)-th region patch cropped from the RGB - D image and centered at \( (x_p,y_p,z_p) \) in the camera coordinate system, as shown in Figure 3(A). And \( \Phi(\cdot) \) represents TOGNet, which is used to predict the grasping posture \( g_p\in G_p \) centered at \( (x_p,y_p,z_p) \): \[ g_p = (\Delta t,\theta,\beta,\gamma,w) \] As shown in Figure 3(C), \( (\theta,\beta,\gamma)\in[-\frac{\pi}{2},\frac{\pi}{2}] \) are the grasping Euler angles in the gripper coordinate system. \( \theta \) represents the rotation angle in the gripper plane, and \( \beta \) and \( \gamma \) represent the gripper directions. To avoid collisions in cluttered environments, the grasping width \( w \) is also predicted. Considering that the guidance may be inaccurate, to improve robustness and accuracy, a 3D position offset \( \Delta t=(\Delta x,\Delta y,\Delta z)\in[-2 \]

Target-Oriented Object Grasping via Multimodal Human Guidance

RTAGrasp: Learning Task-Oriented Grasping from Human Videos via Retrieval, Transfer, and Alignment

Efficient Grasp Detection Network with Gaussian-Based Grasp Representation for Robotic Manipulation

Rethinking 6-Dof Grasp Detection: A Flexible Framework for High-Quality Grasping

UPG: 3D Vision-Based Prediction Framework for Robotic Grasping in Multi-Object Scenes.

Intelligent Grasping with Natural Human-Robot Interaction.

A Visual Detection and Grasping Method Based on Deep Learning

GoalGrasp: Grasping Goals in Partially Occluded Scenarios without Grasp Training

Antipodal-Points-aware Dual-decoding Network for Robotic Visual Grasp Detection Oriented to Multi-object Clutter Scenes

High Precision 6-DoF Grasp Detection in Cluttered Scenes Based on Network Optimization and Pose Propagation

Robotic Grasping in Multi-Object Stacking Scenes Based on Visual Reasoning

Target-referenced Reactive Grasping for Dynamic Objects

MPGNet: Learning Move-Push-Grasping Synergy for Target-Oriented Grasping in Occluded Scenes

MTGrasp: Multiscale 6-Dof Robotic Grasp Detection

Gated Self Attention Network for Efficient Grasping of Target Objects in Stacked Scenarios

Target Prediction and Temporal Localization of Grasping Action for Vision-Assisted Prosthetic Hand

Multitarget Flexible Grasping Detection Method for Robots in Unstructured Environments

A Real-Time Robotic Grasping Approach with Oriented Anchor Box

Learning 6-DoF Task-oriented Grasp Detection via Implicit Estimation and Visual Affordance

Robotic Continuous Grasping System by Shape Transformer-Guided Multi-Object Category-Level 6D Pose Estimation

A grasping posture estimation method based on 3D detection network