AD-DINO: Attention-Dynamic DINO for Distance-Aware Embodied Reference Understanding

Hao Guo,Wei Fan,Baichun Wei,Jianfei Zhu,Jin Tian,Chunzhi Yi,Feng Jiang
2024-11-13
Abstract:Embodied reference understanding is crucial for intelligent agents to predict referents based on human intention through gesture signals and language descriptions. This paper introduces the Attention-Dynamic DINO, a novel framework designed to mitigate misinterpretations of pointing gestures across various interaction contexts. Our approach integrates visual and textual features to simultaneously predict the target object's bounding box and the attention source in pointing gestures. Leveraging the distance-aware nature of nonverbal communication in visual perspective taking, we extend the virtual touch line mechanism and propose an attention-dynamic touch line to represent referring gesture based on interactive distances. The combination of this distance-aware approach and independent prediction of the attention source, enhances the alignment between objects and the gesture represented line. Extensive experiments on the YouRefIt dataset demonstrate the efficacy of our gesture information understanding method in significantly improving task performance. Our model achieves 76.4% accuracy at the 0.25 IoU threshold and, notably, surpasses human performance at the 0.75 IoU threshold, marking a first in this domain. Comparative experiments with distance-unaware understanding methods from previous research further validate the superiority of the Attention-Dynamic Touch Line across diverse contexts.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how agents can more accurately predict the location of target objects when understanding human intentions through gesture signals and language descriptions. Specifically, the author focuses on how to alleviate the problem of misinterpreting gestures (especially pointing gestures) in different interaction scenarios. ### Problem Background Traditional methods for understanding pointing gestures mainly rely on the extension of the arm - finger line from the observer's perspective to locate the target object. However, this method has systematic spatial misunderstandings. Especially in close - range interactions, this mechanism may fail. For example, when the pointer points to an object close to himself/herself, the Virtual Touch Line (VTL) and its extension may not intersect with the target object. In addition, the pointer's limb may be bent arbitrarily, so that the Finger Line (FL) can represent the spatial position of the target object more clearly. ### The Method Proposed in the Paper To solve the above problems, the author proposes the **Attention - Dynamic DINO (AD - DINO)** framework. The main innovations of this framework include: 1. **Introducing the Distance - Aware Visual Perspective Transformation (DA - VPT) mechanism**: Dynamically adjust the attention source according to the distance between the pointer and the target object. For long - distance interactions, the attention source is set as the eyes; for close - range interactions, the attention source is set as the metacarpophalangeal joint (MCP) of the index finger, thus forming a more accurate pointing representation. 2. **Attention - Dynamic Touch Line (ADTL)**: Combine the finger position and the attention source to construct a dynamic touch line to represent the pointing gesture more accurately. 3. **Multi - modal Fusion Model**: Combine visual features and text features, enhance the feature representation through the cross - modal fusion module, and decode through the language - guided query selection module, and finally output the bounding box of the target object and the attention source. 4. **Independently Predict the Attention Source**: Reduce the model training cost and error level, which is more efficient than the method of simultaneously predicting the attention source and finger position pairs. ### Experimental Results The experimental results show that the AD - DINO model has achieved significant performance improvement on the YouRefIt dataset. Especially under the 0.75 IoU threshold, the AD - DINO model has reached an accuracy rate of 55.4%, surpassing human performance for the first time. This marks that the computational model has surpassed human capabilities for the first time in the embodied reference understanding task. ### Summary This paper significantly improves the agent's ability to understand pointing gestures by introducing the distance - aware mechanism and the dynamic touch line, and solves the misinterpretation problem of traditional methods in different interaction scenarios. This achievement has brought important progress in the field of embodied reference understanding.