Gaze-assisted visual grounding via knowledge distillation for referred object grasping with under-specified object referring

Zhuoyang Zhang,Kun Qian,Bo Zhou,Fang Fang,Xudong Ma
DOI: https://doi.org/10.1016/j.engappai.2024.108493
IF: 8
2024-05-04
Engineering Applications of Artificial Intelligence
Abstract:Understanding human referring intention is the key challenge in referred object grasping tasks. Current Referring Expression Comprehension (REC) and Referring Expression Segmentation (RES) methods enable robots to ground the referred object through precise language commands while ignoring the under-specified referring expressions in practical scenarios. In this paper, we propose a gaze-assisted visual grounding network, which efficiently estimates the human gaze via knowledge distillation techniques to provide visual guidance for inferring human referring intention. Specifically, a gaze auxiliary network is introduced to incorporate the human gaze to implicitly eliminate the referring ambiguity. A cross-attention-based multi-modal fusion module is designed with a balanced structure to project the diverse modalities into a common feature space to balance the inter-modal information. A tabletop object dataset with human gaze and under-specified referring expressions was established to evaluate our method. Experimental results show that our method achieves advanced results compared to the state-of-the-art visual grounding method, with a gain of 8.23% on REC precision and 10.44% on RES mIoU (mean Intersection over Union). The effectiveness and superiority of our method are also demonstrated through real-world referred object grasping experiments. The code is now available at https://github.com/robotseu/GVGNet_KD .
automation & control systems,computer science, artificial intelligence,engineering, electrical & electronic, multidisciplinary
What problem does this paper attempt to address?