Abstract:Embodied reference understanding is crucial for intelligent agents to predict referents based on human intention through gesture signals and language descriptions. This paper introduces the Attention-Dynamic DINO, a novel framework designed to mitigate misinterpretations of pointing gestures across various interaction contexts. Our approach integrates visual and textual features to simultaneously predict the target object's bounding box and the attention source in pointing gestures. Leveraging the distance-aware nature of nonverbal communication in visual perspective taking, we extend the virtual touch line mechanism and propose an attention-dynamic touch line to represent referring gesture based on interactive distances. The combination of this distance-aware approach and independent prediction of the attention source, enhances the alignment between objects and the gesture represented line. Extensive experiments on the YouRefIt dataset demonstrate the efficacy of our gesture information understanding method in significantly improving task performance. Our model achieves 76.4% accuracy at the 0.25 IoU threshold and, notably, surpasses human performance at the 0.75 IoU threshold, marking a first in this domain. Comparative experiments with distance-unaware understanding methods from previous research further validate the superiority of the Attention-Dynamic Touch Line across diverse contexts.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how agents can more accurately predict the location of target objects when understanding human intentions through gesture signals and language descriptions. Specifically, the author focuses on how to alleviate the problem of misinterpreting gestures (especially pointing gestures) in different interaction scenarios. ### Problem Background Traditional methods for understanding pointing gestures mainly rely on the extension of the arm - finger line from the observer's perspective to locate the target object. However, this method has systematic spatial misunderstandings. Especially in close - range interactions, this mechanism may fail. For example, when the pointer points to an object close to himself/herself, the Virtual Touch Line (VTL) and its extension may not intersect with the target object. In addition, the pointer's limb may be bent arbitrarily, so that the Finger Line (FL) can represent the spatial position of the target object more clearly. ### The Method Proposed in the Paper To solve the above problems, the author proposes the **Attention - Dynamic DINO (AD - DINO)** framework. The main innovations of this framework include: 1. **Introducing the Distance - Aware Visual Perspective Transformation (DA - VPT) mechanism**: Dynamically adjust the attention source according to the distance between the pointer and the target object. For long - distance interactions, the attention source is set as the eyes; for close - range interactions, the attention source is set as the metacarpophalangeal joint (MCP) of the index finger, thus forming a more accurate pointing representation. 2. **Attention - Dynamic Touch Line (ADTL)**: Combine the finger position and the attention source to construct a dynamic touch line to represent the pointing gesture more accurately. 3. **Multi - modal Fusion Model**: Combine visual features and text features, enhance the feature representation through the cross - modal fusion module, and decode through the language - guided query selection module, and finally output the bounding box of the target object and the attention source. 4. **Independently Predict the Attention Source**: Reduce the model training cost and error level, which is more efficient than the method of simultaneously predicting the attention source and finger position pairs. ### Experimental Results The experimental results show that the AD - DINO model has achieved significant performance improvement on the YouRefIt dataset. Especially under the 0.75 IoU threshold, the AD - DINO model has reached an accuracy rate of 55.4%, surpassing human performance for the first time. This marks that the computational model has surpassed human capabilities for the first time in the embodied reference understanding task. ### Summary This paper significantly improves the agent's ability to understand pointing gestures by introducing the distance - aware mechanism and the dynamic touch line, and solves the misinterpretation problem of traditional methods in different interaction scenarios. This achievement has brought important progress in the field of embodied reference understanding.

AD-DINO: Attention-Dynamic DINO for Distance-Aware Embodied Reference Understanding

GVGNet: Gaze-Directed Visual Grounding for Learning Under-Specified Object Referring Intention

Understanding Embodied Reference with Touch-Line Transformer

Gaze-assisted visual grounding via knowledge distillation for referred object grasping with under-specified object referring

YouRefIt: Embodied Reference Understanding with Language and Gesture

DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding

AIR-Embodied: An Efficient Active 3DGS-based Interaction and Reconstruction Framework with Embodied Large Language Model

Solving visual object ambiguities when pointing: an unsupervised learning approach

YOUREFIT: EMBODIED REFERENCE UNDERSTAND-

Distance Matters in Human-Object Interaction Detection

Understanding Atomic Hand-Object Interaction With Human Intention

A dynamic attention mechanism for object detection in road or strip environments

RefMask3D: Language-Guided Transformer for 3D Referring Segmentation

InterTracker: Discovering and Tracking General Objects Interacting with Hands in the Wild

ScanERU: Interactive 3D Visual Grounding based on Embodied Reference Understanding

Distance-Aware Occlusion Detection with Focused Attention

GestureGPT: Toward Zero-shot Interactive Gesture Understanding and Grounding with Large Language Model Agents

Beyond One-to-One: Rethinking the Referring Image Segmentation

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

WEA-DINO: An Improved DINO With Word Embedding Alignment for Remote Scene Zero-Shot Object Detection

Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection