Abstract:Image-text retrieval, as a fundamental and important branch of information retrieval, has attracted extensive research attentions. The main challenge of this task is cross-modal semantic understanding and matching. Some recent works focus more on fine-grained cross-modal semantic matching. With the prevalence of large scale multimodal pretraining models, several state-of-the-art models (e.g. X-VLM) have achieved near-perfect performance on widely-used image-text retrieval benchmarks, i.e. MSCOCO-Test-5K and Flickr30K-Test-1K. In this paper, we review the two common benchmarks and observe that they are insufficient to assess the true capability of models on fine-grained cross-modal semantic matching. The reason is that a large amount of images and texts in the benchmarks are coarse-grained. Based on the observation, we renovate the coarse-grained images and texts in the old benchmarks and establish the improved benchmarks called MSCOCO-FG and Flickr30K-FG. Specifically, on the image side, we enlarge the original image pool by adopting more similar images. On the text side, we propose a novel semi-automatic renovation approach to refine coarse-grained sentences into finer-grained ones with little human effort. Furthermore, we evaluate representative image-text retrieval models on our new benchmarks to demonstrate the effectiveness of our method. We also analyze the capability of models on fine-grained semantic comprehension through extensive experiments. The results show that even the state-of-the-art models have much room for improvement in fine-grained semantic understanding, especially in distinguishing attributes of close objects in images. Our code and improved benchmark datasets are publicly available at: <a class="link-external link-https" href="https://github.com/cwj1412/MSCOCO-Flikcr30K_FG" rel="external noopener nofollow">this https URL</a>, which we hope will inspire further in-depth research on cross-modal retrieval.

A Framework for Image Text Retrieval Based on Large Language Model

Semantic Image Retrieval Based on Multiple-Instance Learning

Modeling Image Data for Effective Indexing and Retrieval in Large General Image Databases.

Rethinking Sparse Lexical Representations for Image Retrieval in the Age of Rising Multi-Modal Large Language Models

An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation

Semantic Completion and Filtration for Image–Text Retrieval

Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models

Context‐aware relation enhancement and similarity reasoning for image‐text retrieval

Rethinking Benchmarks for Cross-modal Image-text Retrieval

Improving Cross-Modal Image-Text Retrieval With Teacher-Student Learning

Enhanced Semantic Similarity Learning Framework for Image-Text Matching

A Study of Language Model for Image Retrieval

Beyond Text: Frozen Large Language Models in Visual Signal Comprehension

LLMScore: Unveiling the Power of Large Language Models in Text-to-Image Synthesis Evaluation

CFIR: Fast and Effective Long-Text To Image Retrieval for Large Corpora

ViLEM: Visual-Language Error Modeling for Image-Text Retrieval

Image-text matching using multi-subspace joint representation

From Text to Pixel: Advancing Long-Context Understanding in MLLMs

LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Retrieval

Image-text Retrieval via Preserving Main Semantics of Vision

A Probabilistic Semantic Model for Image Annotation and Multi-Modal Image Retrieval