Abstract:Image-text retrieval (ITR) has been one of the primary tasks in cross-modal retrieval, serving as a crucial bridge between computer vision and natural language processing. Significant progress has been made to achieve global alignment and local alignment between images and texts by mapping images and texts into a common space to establish correspondences between these two modalities. However, the rich semantic content contained in each image may bring false matches, resulting in the matched text ignoring the main semantics but focusing on the secondary or other semantics of this image. To address this issue, this paper proposes a semantically optimized approach with a novel Main Semantics Consistency (MSC) loss function, which aims to rank the semantically most similar images (or texts) corresponding to the given query at the top position during the retrieval process. First, in each batch of image-text pairs, we separately compute (i) the image-image similarity, i.e., the similarity between every two images, (ii) the text-text similarity, i.e., the similarity between a group of texts (that belong to a certain image) and another group of texts (that belong to another image), and (iii) the image-text similarity, i.e., the similarity between each image and each text. Afterward, our proposed MSC effectively aligns the above image-image, image-text, and text-text similarity, since the main semantics of every two images will be highly close if their text descriptions remain highly semantically consistent. By this means, we can capture the main semantics of each image to be matched with its corresponding texts, prioritizing the semantically most related retrieval results. Extensive experiments on MSCOCO and FLICKR30K verify the superior performance of MSC compared with the SOTA image-text retrieval methods. The source code of this project is released at GitHub: https://github.com/xyi007/MSC.

Commonsense-Guided Semantic and Relational Consistencies for Image-Text Retrieval

Cross-Modal Image-Text Retrieval with Semantic Consistency

Image-Text Retrieval with Cross-Modal Semantic Importance Consistency.

Multilateral Semantic Relations Modeling for Image Text Retrieval

Consensus-Aware Visual-Semantic Embedding for Image-Text Matching

Context‐aware relation enhancement and similarity reasoning for image‐text retrieval

Multi-view and region reasoning semantic enhancement for image-text retrieval

Semantic Completion: Enhancing Image-Text Retrieval with Information Extraction and Compression

Knowledge Aware Semantic Concept Expansion for Image-Text Matching.

Exploring Uni-Modal Feature Learning on Entities and Relations for Remote Sensing Cross-Modal Text-Image Retrieval

A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing

Semantic Completion and Filtration for Image–Text Retrieval

Image-text Retrieval via Preserving Main Semantics of Vision

Image-text Retrieval with Main Semantics Consistency

Multi-level similarity learning for image-text retrieval

SEMScene: Semantic-Consistency Enhanced Multi-Level Scene Graph Matching for Image-Text Retrieval

Image-Text Embedding Learning Via Visual and Textual Semantic Reasoning.

Cross-Modal Attention With Semantic Consistence for Image–Text Matching

Bi-Directional Image-Text Retrieval with Position Attention and Similarity Filtering

External Knowledge Dynamic Modeling for Image-text Retrieval

Cross-modal Semantic Enhanced Interaction for Image-Sentence Retrieval