GVGNet: Gaze-Directed Visual Grounding for Learning Under-Specified Object Referring Intention

Kun Qian,Zhuoyang Zhang,Wei Song,Jianfeng Liao
DOI: https://doi.org/10.1109/lra.2023.3301294
2023-01-01
Abstract:Referring Expression Comprehension (REC) and Referring Expression Segmentation (RES) enable robots to infer human's object referring intention through natural languages. In this letter, Gaze-directed Visual Grounding Network (GVGNet) is proposed to disambiguate human's under-specified object referring intention in the joint task of REC and RES. To reduce the uncertainty of the referred object, the human gaze is first introduced to explicitly indicate the target location. A multi-modal feature fusion module is further applied to model the context between the language, image, and gaze modalities for the subsequent localization and segmentation modules. To train our network, a tabletop object dataset with human gaze, namely OCID-underRef, is extended from the existing OCID-Ref dataset through synthetic gaze modeling. The domain gap between the synthetic and the real-world gaze distribution is further reduced through a gaze rectification method using dynamic grid assignment and head pose constraint. Experimental results validate the effectiveness of our method in both OCID-underRef and real-world scenarios for referred object grasping tasks with under-specified object referring expressions.
What problem does this paper attempt to address?