Abstract:As a subtask of visual grounding（VG）, referring expression comprehension（REC） is focused on the input referring expression-defined object location in the given image. To optimize multimodal data-based artificial intelligence（AI） tasks, the REC has used to facilitate interaction ability between humans, machines, and the physical world. The REC can be used for such domains like navigation, autonomous driving, robotics, and early education in terms of visual understanding systems and dialogue systems. Additionally, it is beneficial for other related studies, including 1) image retrieval, 2) image captioning, and 3) visual question answering. In the past two decades, computer vision-oriented object detection has been developing dramatically, which can locate all predefined and fixed categories objects. To get the referring expression input-defined object, a challenging problem of the REC is required for multiple objects-related reasoning.The general process of REC can be divided into three modules: linguistic feature extraction, visual feature extraction, and visual-linguistic fusion. The most important one of three modules is visual-linguistic fusion, which can realize the interaction and screening between linguistic and visual features. Furthermore, current researches are oriented to the design of the visual feature extraction module, which is recognized as the basic module of one REC model to a certain extent. Visual input has richer information than text input and more redundant information interference are required to be alleviated. So, the potentials of object localization are linked to extracting effective visual features further. We segment existing REC methods into three categories. 1) Regional convolution granularity visual representation method, it can be divided into five subcategories in accordance with visual-linguistic fusion module based modeling:（1）early,（2） attention mechanism fusion,（3） expression decomposition fusion,（4）graph network fusion, and（5）Transformer-based fusion. It is still challenged for computational cost and lower speed because it is required to generate object proposals for the input image in advance. Moreover, the performance of the REC model is challenged for the quality of the object proposals as well. 2) Grid convolution granularity visual representation method: the multi-modal fusion module of it can be divided into two categories:（1）filtering-based fusion and（2）Transformer-based fusion. Its model inference speed can be accelerated to 10 times at least since the generation of object proposals is not required for that. 3) Image patch granularity visual representation method: as visual feature extractors, two methods mentioned above are based on pre-trained object detection networks or convolutional networks. The visual features are still challenged to match REC-required visual elements. Therefore, more researches are focused on the integration of visual feature extraction module and the visual-linguistic fusion module, in which image patches-derived pixel can be as the input. To be compatible with the requirements of the REC task, direct text input-guided visual features are generated beyond pre-trained convolutional neural network（CNN） visual feature extractor.The REC mission are introduced and clarified on the basis of four popular datasets and the evaluation methods. Furthermore, three sort of REC-contextual challenging problems are required to be resolved: 1) model’s reasoning speed, 2) interpretability of the model, and 3) reasoning ability of the model to expressions. The video and 3D domains-related future research direction of REC is predicted and analyzed further on the two aspects of its model design and domain development.

GVGNet: Gaze-Directed Visual Grounding for Learning Under-Specified Object Referring Intention

Gaze-assisted visual grounding via knowledge distillation for referred object grasping with under-specified object referring

Learning Visual Grounding from Generative Vision and Language Model

Iterative Robust Visual Grounding with Masked Reference based Centerpoint Supervision

OCID-Ref: A 3D Robotic Dataset with Embodied Language for Clutter Scene Grounding

Multimodal Referring Expression Comprehension Based on Image and Text:A Review

OVGNet: A Unified Visual-Linguistic Framework for Open-Vocabulary Robotic Grasping

End-to-end Visual Grounding Based on Query Text Guidance and Multi-stage Reasoning

Grounding Spatio-Semantic Referring Expressions for Human-Robot Interaction

GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection

Relationship-Embedded Representation Learning for Grounding Referring Expressions

Two-stage Visual Cues Enhancement Network for Referring Image Segmentation

RES-StS: Referring Expression Speaker via Self-training with Scorer for Goal-Oriented Vision-Language Navigation

ResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual Grounding

Towards Further Comprehension on Referring Expression with Rationale

Mask Grounding for Referring Image Segmentation

Guidance and Teaching Network for Video Salient Object Detection

REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments.

Unpaired Referring Expression Grounding via Bidirectional Cross-Modal Matching

Beyond Literal Descriptions: Understanding and Locating Open-World Objects Aligned with Human Intentions

Self-paced Multi-grained Cross-modal Interaction Modeling for Referring Expression Comprehension