YOUREFIT: EMBODIED REFERENCE UNDERSTAND-

Yixin Chen,Qing Li,Deqian Kong,Yik Lun Kei,Tao Gao,Yixin Zhu,Song-Chun Zhu,Siyuan Huang
2021-01-01
Abstract:We study the machine’s understanding of embodied reference: One agent uses both language and gesture to refer to an object to another agent in a shared physical environment. Of note, this new visual task requires understanding multimodal information with visual perspective-taking to identify which object is being referred to. To tackle this problem, we introduce YouRefIt, a new crowd-sourced, large-scale real-world dataset of embodied reference; the dataset contains 4,195 unique reference clips in 432 indoor scenes. To the best of our knowledge, this is the first embodied reference dataset that affords us to study referring expressions in real-world scenes for understanding referential behavior, human communications, and human-robot interaction. We further devise two benchmarks for imagebased and video-based embodied reference understanding. Our results provide overwhelming evidence that gestural information is as critical as language information in understanding the embodied reference, indicating the significance of incorporating gestures for visual scene understanding.
What problem does this paper attempt to address?