Abstract:Image-text retrieval, as a fundamental and important branch of information retrieval, has attracted extensive research attentions. The main challenge of this task is cross-modal semantic understanding and matching. Some recent works focus more on fine-grained cross-modal semantic matching. With the prevalence of large scale multimodal pretraining models, several state-of-the-art models (e.g. X-VLM) have achieved near-perfect performance on widely-used image-text retrieval benchmarks, i.e. MSCOCO-Test-5K and Flickr30K-Test-1K. In this paper, we review the two common benchmarks and observe that they are insufficient to assess the true capability of models on fine-grained cross-modal semantic matching. The reason is that a large amount of images and texts in the benchmarks are coarse-grained. Based on the observation, we renovate the coarse-grained images and texts in the old benchmarks and establish the improved benchmarks called MSCOCO-FG and Flickr30K-FG. Specifically, on the image side, we enlarge the original image pool by adopting more similar images. On the text side, we propose a novel semi-automatic renovation approach to refine coarse-grained sentences into finer-grained ones with little human effort. Furthermore, we evaluate representative image-text retrieval models on our new benchmarks to demonstrate the effectiveness of our method. We also analyze the capability of models on fine-grained semantic comprehension through extensive experiments. The results show that even the state-of-the-art models have much room for improvement in fine-grained semantic understanding, especially in distinguishing attributes of close objects in images. Our code and improved benchmark datasets are publicly available at: <a class="link-external link-https" href="https://github.com/cwj1412/MSCOCO-Flikcr30K_FG" rel="external noopener nofollow">this https URL</a>, which we hope will inspire further in-depth research on cross-modal retrieval.

Semantic Completion: Enhancing Image-Text Retrieval with Information Extraction and Compression

Semantic Completion and Filtration for Image–Text Retrieval

Research on high-level semantic image retrieval

Cross-Modal Image-Text Retrieval with Semantic Consistency

Image-text Retrieval via Preserving Main Semantics of Vision

Image-Text Retrieval with Cross-Modal Semantic Importance Consistency.

Image-text Retrieval with Main Semantics Consistency

Commonsense-Guided Semantic and Relational Consistencies for Image-Text Retrieval

Multi-view and region reasoning semantic enhancement for image-text retrieval

SEMScene: Semantic-Consistency Enhanced Multi-Level Scene Graph Matching for Image-Text Retrieval

Image-Text Embedding Learning Via Visual and Textual Semantic Reasoning.

Visual Semantic Reasoning for Image-Text Matching

Cross-modal Semantic Interference Suppression for image-text matching

Cross-Modal Attention With Semantic Consistence for Image–Text Matching

Semantic enhancement and multi-level alignment network for cross-modal retrieval

Cross-modal Semantic Enhanced Interaction for Image-Sentence Retrieval

Rethinking Benchmarks for Cross-modal Image-text Retrieval

Semantic Communication Approach for Multi-Task Image Transmission

A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing

Multilateral Semantic Relations Modeling for Image Text Retrieval

Perceptual Image Compression with Cooperative Cross-Modal Side Information