Abstract:Image-text retrieval (ITR) has been one of the primary tasks in cross-modal retrieval, serving as a crucial bridge between computer vision and natural language processing. Significant progress has been made to achieve global alignment and local alignment between images and texts by mapping images and texts into a common space to establish correspondences between these two modalities. However, the rich semantic content contained in each image may bring false matches, resulting in the matched text ignoring the main semantics but focusing on the secondary or other semantics of this image. To address this issue, this paper proposes a semantically optimized approach with a novel Main Semantics Consistency (MSC) loss function, which aims to rank the semantically most similar images (or texts) corresponding to the given query at the top position during the retrieval process. First, in each batch of image-text pairs, we separately compute (i) the image-image similarity, i.e., the similarity between every two images, (ii) the text-text similarity, i.e., the similarity between a group of texts (that belong to a certain image) and another group of texts (that belong to another image), and (iii) the image-text similarity, i.e., the similarity between each image and each text. Afterward, our proposed MSC effectively aligns the above image-image, image-text, and text-text similarity, since the main semantics of every two images will be highly close if their text descriptions remain highly semantically consistent. By this means, we can capture the main semantics of each image to be matched with its corresponding texts, prioritizing the semantically most related retrieval results. Extensive experiments on MSCOCO and FLICKR30K verify the superior performance of MSC compared with the SOTA image-text retrieval methods. The source code of this project is released at GitHub: https://github.com/xyi007/MSC.

Mutil-level Local Alignment and Semantic Matching Network for Image-Text Retrieval

Local Alignment with Global Semantic Consistence Network for Image–Text Matching

Semantic enhancement and multi-level alignment network for cross-modal retrieval

Dual Semantic Relationship Attention Network for Image-Text Matching

Multi-level similarity learning for image-text retrieval

Cross-modal Semantically Augmented Network for Image-text Matching

Multi-Scale Fine-Grained Alignments for Image and Sentence Matching

Conceptual and Syntactical Cross-modal Alignment with Cross-level Consistency for Image-Text Matching.

Cross-Modal Attention With Semantic Consistence for Image–Text Matching

Multi-level network based on transformer encoder for fine-grained image–text matching

Multilateral Semantic Relations Modeling for Image Text Retrieval

Image-text Retrieval with Main Semantics Consistency

HAAN: Learning a Hierarchical Adaptive Alignment Network for Image-Text Retrieval

Step-Wise Hierarchical Alignment Network for Image-Text Matching

Multi-scale Motivated Neural Network for Image-Text Matching

Image-Text Retrieval with Cross-Modal Semantic Importance Consistency.

Bottom-Up Progressive Semantic Alignment for Image-Text Retrieval

Attention-Based Multi-level Network for Text Matching with Feature Fusion

SEMScene: Semantic-Consistency Enhanced Multi-Level Scene Graph Matching for Image-Text Retrieval

A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing

Bi-directional Spatial-Semantic Attention Networks for Image-Text Matching.