Hierarchical Multi-modal Fusion for Language-conditioned Robotic Grasping Detection in Clutter

Jin Liu,Jialong Xie,Leibing Xiao,Chaoqun Wang,Fengyu Zhou
DOI: https://doi.org/10.1109/lra.2024.3440833
2024-01-01
Abstract:This letter concentrates on the challenging task of language-conditioned grasping detection in clutter, where the grasping postures of objects should be generated for robots according to complicated human instructions. Existing methods typically employ well-trained object detectors and leverage language similarities to localize a single object. Subsequently, grasping postures are generated based on the identified object. However, such sequential approaches can lead to error accumulation and inconsistent results between object grasping detection and object localization due to the inability to comprehend sentence logic. In this letter, we propose an end-to-end network for object localization and grasping detection to tackle these challenges. Specifically, we first extract salient objects and spatial relationships from sentence logic and employ one Contrastive Language-Image Pre-training model (CLIP (Yang et al., 2022)) to acquire both visual and textual features. Subsequently, we introduce different vision-language fusion modules to conduct multi-modal fusions at the object-level, spatial-level, and global-level aspects, respectively. With the obtained multi-modal features, we further design a hierarchical feature modeling mechanism that integrates the fused features to achieve simultaneous object localization and accurate grasping detection. Extensive experiments on the real-world dataset and robotic applications demonstrate the effectiveness and accuracy of our proposed method.
What problem does this paper attempt to address?