Target-Oriented Object Grasping via Multimodal Human Guidance

Pengwei Xie,Siang Chen,Dingchang Hu,Yixiang Dai,Kaiqin Yang,Guijin Wang
2024-08-21
Abstract:In the context of human-robot interaction and collaboration scenarios, robotic grasping still encounters numerous challenges. Traditional grasp detection methods generally analyze the entire scene to predict grasps, leading to redundancy and inefficiency. In this work, we reconsider 6-DoF grasp detection from a target-referenced perspective and propose a Target-Oriented Grasp Network (TOGNet). TOGNet specifically targets local, object-agnostic region patches to predict grasps more efficiently. It integrates seamlessly with multimodal human guidance, including language instructions, pointing gestures, and interactive clicks. Thus our system comprises two primary functional modules: a guidance module that identifies the target object in 3D space and TOGNet, which detects region-focal 6-DoF grasps around the target, facilitating subsequent motion planning. Through 50 target-grasping simulation experiments in cluttered scenes, our system achieves a success rate improvement of about 13.7%. In real-world experiments, we demonstrate that our method excels in various target-oriented grasping scenarios.
Robotics,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges encountered by robots when grasping objects in human - machine interaction and collaboration scenarios. Specifically, traditional grasp detection methods usually analyze the entire scene to predict grasping actions, which leads to redundancy and inefficiency. This paper reconsiders 6 - degrees - of - freedom (6 - DoF) grasp detection and proposes a new model named Target - Oriented GraspNetwork (TOGNet) from the perspective of target reference. TOGNet focuses on local, object - independent region patches to predict grasping actions more efficiently and seamlessly integrates multi - modal human guidance, including language instructions, pointing gestures and interactive clicks. In this way, the system can perform more accurate target - oriented grasping in cluttered environments. ### Main Problems 1. **Inefficiency of Traditional Grasp Detection Methods**: Traditional methods analyze the entire scene to predict grasping actions, resulting in a waste of computational resources and low efficiency. 2. **Insufficient Grasping Precision in Complex Environments**: In cluttered scenes, traditional methods are difficult to generate high - quality grasping points, especially when facing new scenes. 3. **Lack of Specificity for Specific Targets**: Existing methods often require additional computational steps, such as target segmentation and grasp filtering, when dealing with specific target grasping, increasing unnecessary computational burden. ### Solutions To solve the above problems, this paper proposes a new framework with the following main contributions: 1. **Constructing a Multi - Modal Guidance Pipeline**: The system integrates multiple state - of - the - art computer vision models, which can analyze and understand human intentions and efficiently crop target objects. This system is especially helpful for assisting people with visual, auditory or motor disabilities. 2. **Designing Target - Oriented GraspNetwork (TOGNet)**: TOGNet aims to detect 6 - DoF grasping postures from the target - reference area, thereby simplifying the robot's motion planning process. TOGNet is trained on a fine - grained region - focused dataset, which is object - independent and therefore still effective when facing new scenes. 3. **Evaluating the System's Performance**: The system is evaluated through datasets, simulation experiments and real - world experiments. Specifically, a new evaluation metric is proposed to adapt to the target - oriented setting, and the grasping quality is compared with the recent state - of - the - art methods. In addition, the target - oriented grasping success rate in 50 cluttered scenes is evaluated on the Maniskill2 benchmark, and the system's ability to understand and respond to multi - modal human guidance is verified on an actual robot platform. ### Formula Representation In describing the specific grasping prediction process, some formulas are used in the paper to represent the prediction process of grasping postures. For example, assume that multiple RGB - D region patches are extracted from the guidance module, and the goal is to predict the 6 - DoF grasping posture: \[ G_p=\Phi(f_i|i = 1,\ldots,K) \] where \( f_i\in\mathbb{R}^{N\times3} \) is the \( i \)-th region patch cropped from the RGB - D image and centered at \( (x_p,y_p,z_p) \) in the camera coordinate system, as shown in Figure 3(A). And \( \Phi(\cdot) \) represents TOGNet, which is used to predict the grasping posture \( g_p\in G_p \) centered at \( (x_p,y_p,z_p) \): \[ g_p = (\Delta t,\theta,\beta,\gamma,w) \] As shown in Figure 3(C), \( (\theta,\beta,\gamma)\in[-\frac{\pi}{2},\frac{\pi}{2}] \) are the grasping Euler angles in the gripper coordinate system. \( \theta \) represents the rotation angle in the gripper plane, and \( \beta \) and \( \gamma \) represent the gripper directions. To avoid collisions in cluttered environments, the grasping width \( w \) is also predicted. Considering that the guidance may be inaccurate, to improve robustness and accuracy, a 3D position offset \( \Delta t=(\Delta x,\Delta y,\Delta z)\in[-2 \]