Abstract:Fine-grained image-text retrieval has been a hot research topic to bridge the vision and languages, and its main challenge is how to learn the semantic correspondence across different modalities. The existing methods mainly focus on learning the global semantic correspondence or intramodal relation correspondence in separate data representations, but which rarely consider the intermodal relation that interactively provide complementary hints for fine-grained semantic correlation learning. To address this issue, we propose a relation-aggregated cross-graph (RACG) model to explicitly learn the fine-grained semantic correspondence by aggregating both intramodal and intermodal relations, which can be well utilized to guide the feature correspondence learning process. More specifically, we first build semantic-embedded graph to explore both fine-grained objects and their relations of different media types, which aim not only to characterize the object appearance in each modality, but also to capture the intrinsic relation information to differentiate intramodal discrepancies. Then, a cross-graph relation encoder is newly designed to explore the intermodal relation across different modalities, which can mutually boost the cross-modal correlations to learn more precise intermodal dependencies. Besides, the feature reconstruction module and multihead similarity alignment are efficiently leveraged to optimize the node-level semantic correspondence, whereby the relation-aggregated cross-modal embeddings between image and text are discriminatively obtained to benefit various image-text retrieval tasks with high retrieval performance. Extensive experiments evaluated on benchmark datasets quantitatively and qualitatively verify the advantages of the proposed framework for fine-grained image-text retrieval and show its competitive performance with the state of the arts.

Visual Relations Augmented Cross-modal Retrieval

Visual Spatio-temporal Relation-enhanced Network for Cross-modal Text-Video Retrieval

Deep Relation Embedding for Cross-Modal Retrieval.

Semantic Modeling of Textual Relationships in Cross-modal Retrieval

Relation-Aggregated Cross-Graph Correlation Learning for Fine-Grained Image–Text Retrieval

Towards Bridged Vision and Language: Learning Cross-modal Knowledge Representation for Relation Extraction

Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval

On Analyzing the Role of Image for Visual-Enhanced Relation Extraction (student Abstract).

Spatial-temporal Graphs for Cross-modal Text2Video Retrieval

Context‐aware relation enhancement and similarity reasoning for image‐text retrieval

Multimodal Relation Extraction with Cross-Modal Retrieval and Synthesis

Relation Triplet Construction for Cross-modal Text-to-Video Retrieval

Cross-Graph Attention Enhanced Multi-Modal Correlation Learning for Fine-Grained Image-Text Retrieval

Multimodal Relation Extraction via a Mixture of Hierarchical Visual Context Learners

BiC-Net: Learning Efficient Spatio-Temporal Relation for Text-Video Retrieval

Video Relation Detection with Spatio-Temporal Graph

Cross-modal alignment with graph reasoning for image-text retrieval

ResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual Grounding

Video Visual Relation Detection Via Multi-modal Feature Fusion

Multi-view and region reasoning semantic enhancement for image-text retrieval

Based on Spatial and Temporal Implicit Semantic Relational Inference for Cross-Modal Retrieval