Abstract:Image-Text Matching is one major task in cross-modal information processing. The main challenge is to learn the unified visual and textual representations. Previous methods that perform well on this task primarily focus on not only the alignment between region features in images and the corresponding words in sentences, but also the alignment between relations of regions and relational words. However, the lack of joint learning of regional features and global features will cause the regional features to lose contact with the global context, leading to the mismatch with those non-object words which have global meanings in some sentences. In this work, in order to alleviate this issue, it is necessary to enhance the relations between regions and the relations between regional and global concepts to obtain a more accurate visual representation so as to be better correlated to the corresponding text. Thus, a novel multi-level semantic relations enhancement approach named Dual Semantic Relations Attention Network(DSRAN) is proposed which mainly consists of two modules, separate semantic relations module and the joint semantic relations module. DSRAN performs graph attention in both modules respectively for region-level relations enhancement and regional-global relations enhancement at the same time. With these two modules, different hierarchies of semantic relations are learned simultaneously, thus promoting the image-text matching process by providing more information for the final visual representation. Quantitative experimental results have been performed on MS-COCO and Flickr30K and our method outperforms previous approaches by a large margin due to the effectiveness of the dual semantic relations learning scheme.

Scene graph semantic inference for image and text matching

SEMScene: Semantic-Consistency Enhanced Multi-Level Scene Graph Matching for Image-Text Retrieval

Visual-Semantic Graph Matching for Visual Grounding

Scene Graph Based Fusion Network For Image-Text Retrieval

Visual Semantic Reasoning for Image-Text Matching

Graph Structured Network for Image-Text Matching

Visual-Semantic Matching by Exploring High-Order Attention and Distraction

HSGMP: Heterogeneous Scene Graph Message Passing for Cross-modal Retrieval

Fast Contextual Scene Graph Generation with Unbiased Context Augmentation.

Cross-modal Graph Matching Network for Image-text Retrieval

Scene-text aware cross-modal retrieval based on semantic matching (ChinaMM2024)

Complex Scene Image Editing by Scene Graph Comprehension

Scene Graph Generation: A Comprehensive Survey

Multi-Modal Reasoning Graph for Scene-Text Based Fine-Grained Image Classification and Retrieval

Learning Dual Semantic Relations with Graph Attention for Image-Text Matching

SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning

Dual Semantic Relationship Attention Network for Image-Text Matching

SG-PGM: Partial Graph Matching Network with Semantic Geometric Fusion for 3D Scene Graph Alignment and Its Downstream Tasks

SG-Adapter: Enhancing Text-to-Image Generation with Scene Graph Guidance

Learning 3D Semantic Scene Graphs From 3D Indoor Reconstructions

Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-Modal Structured Representations