Gated Self Attention Network for Efficient Grasping of Target Objects in Stacked Scenarios

Yuxuan Wang,Yuru Gong,Jianhua Wu,Zhenhua Xiong
DOI: https://doi.org/10.1109/rcar61438.2024.10671227
2024-01-01
Abstract:Grasping in stacked scenarios is an indispensable capability for intelligent robots. However, in the context of multi-object stacking or occluded scenes, existing algorithms for direct target object grasping result in a high failure rate, while methods for scene clearing grasping lead to inefficiency. Hence, executing grasping operations for target object in a logical sequence is imperative. To address this challenge, we propose an end-to-end grasping model based on a Gated Self Attention Network(GSAN), designed to guide robots to perform optimal sequential grasping of target objects within dense and cluttered scenes. We integrate object detection, grasp detection, and stacking relationship reasoning into a single deep neural network. Specifically, the object detection and grasp detection networks extract features from input RGB images and estimate object categories, bounding boxes and grasp poses. The GSAN captures the non-Euclidean information between object features in high-dimensional space, enhancing the accuracy of triplet relationship reasoning through gated self attention and positional encoding. Our algorithm achieves the best results in the Visual Manipulation Relationship Dataset (VMRD) with an OP of 92.07%, an OR of 91.67%, and an IA of 81.67%, and extensive ablation studies confirm the necessity of each component of our method. As the first end-to-end grasping framework to incorporate self attention into the relationship reasoning module, our proposed method enhances the logical capabilities of robots, enabling efficient grasping operations in complex and dynamic scenes, and fostering human-robot collaboration.
What problem does this paper attempt to address?