Abstract:Referring Expression Comprehension (REC) is an important task in the vision-and-language community, since it is an essential step for many cross-modal tasks such as VQA, image retrieval and image caption. To obtain a better trade-off between speed and accuracy, existing researches usually follow a one-stage paradigm, where this task can be considered as a language-conditioned object detection task. Meanwhile, previous one-stage REC frameworks provide many different research perspectives, such as the strategies of fusion, the stage of fusion and the design of detection head. Surprisingly, these works mostly ignore the value of integrating multi-level features and even only apply single-scale features to locate the target. In this paper, we focus on rethinking and improving feature pyramids for one-stage REC. By experimental validations, we first prove that although multi-scale fusion is an effective approach for improving performance, the mature neck structures from object detection (e.g., FPN, BFN and HRFPN) have a limited impact on this task. Further, we visualize the outputs of FPN and find the underlying reason is that these coarse-grained FPN fusion strategies suffer from semantic ambiguity problem. Based on the above insights, we propose a new Language-Guided FPN (LG-FPN) method, which can dynamically allocate and select the fine-grained information by stacking language-gate and union-gate. A large number of contrastive and ablative experiments show that our LG-FPN is an effective and reliable module that can adapt to different visual backbones, fusion strategies and detection heads. Finally, our method achieves state-of-the-art performance on four referring expression datasets.

Entity Relation Fusion for Real-Time One-Stage Referring Expression Comprehension.

A Multi-Scale Language Embedding Network for Proposal-Free Referring Expression Comprehension.

Rethinking and Improving Feature Pyramids for One-Stage Referring Expression Comprehension

A Real-time Global Inference Network for One-stage Referring Expression Comprehension

A Graph-Theoretic Fusion Framework for Unsupervised Entity Resolution

LGR-NET: Language Guided Reasoning Network for Referring Expression Comprehension

Performance of representation fusion model for entity and relationship extraction within unstructured text

Representation iterative fusion based on heterogeneous graph neural network for joint entity and relation extraction

Coarse-to-Fine Entity Representations for Document-level Relation Extraction

What Goes beyond Multi-modal Fusion in One-stage Referring Expression Comprehension: An Empirical Study

Entity-enhanced Adaptive Reconstruction Network for Weakly Supervised Referring Expression Grounding

One for All: One-stage Referring Expression Comprehension with Dynamic Reasoning

FSN: Joint Entity and Relation Extraction Based on Filter Separator Network

SINet: Improving relational features in two-stage referring expression comprehension

Dual-Gated Fusion with Prefix-Tuning for Multi-Modal Relation Extraction

Multi-Encoder with Entity-Aware Embedding Framework for Distantly Supervised Relation Extraction

Fine-grained Facial Expression Recognition Via Relational Reasoning and Hierarchical Relation Optimization

Relationship-Embedded Representation Learning for Grounding Referring Expressions

Referring Segmentation Via Encoder-Fused Cross-Modal Attention Network

Language-Conditioned Region Proposal and Retrieval Network for Referring Expression Comprehension

Recall, Retrieve and Reason: Towards Better In-Context Relation Extraction