Abstract:Image-text retrieval, as a fundamental and important branch of information retrieval, has attracted extensive research attentions. The main challenge of this task is cross-modal semantic understanding and matching. Some recent works focus more on fine-grained cross-modal semantic matching. With the prevalence of large scale multimodal pretraining models, several state-of-the-art models (e.g. X-VLM) have achieved near-perfect performance on widely-used image-text retrieval benchmarks, i.e. MSCOCO-Test-5K and Flickr30K-Test-1K. In this paper, we review the two common benchmarks and observe that they are insufficient to assess the true capability of models on fine-grained cross-modal semantic matching. The reason is that a large amount of images and texts in the benchmarks are coarse-grained. Based on the observation, we renovate the coarse-grained images and texts in the old benchmarks and establish the improved benchmarks called MSCOCO-FG and Flickr30K-FG. Specifically, on the image side, we enlarge the original image pool by adopting more similar images. On the text side, we propose a novel semi-automatic renovation approach to refine coarse-grained sentences into finer-grained ones with little human effort. Furthermore, we evaluate representative image-text retrieval models on our new benchmarks to demonstrate the effectiveness of our method. We also analyze the capability of models on fine-grained semantic comprehension through extensive experiments. The results show that even the state-of-the-art models have much room for improvement in fine-grained semantic understanding, especially in distinguishing attributes of close objects in images. Our code and improved benchmark datasets are publicly available at: <a class="link-external link-https" href="https://github.com/cwj1412/MSCOCO-Flikcr30K_FG" rel="external noopener nofollow">this https URL</a>, which we hope will inspire further in-depth research on cross-modal retrieval.

Improving Cross-Modal Image-Text Retrieval With Teacher-Student Learning

Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval

CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for Image-Text Retrieval

Cross-Modal Image-Text Retrieval with Semantic Consistency

Cross-Graph Attention Enhanced Multi-Modal Correlation Learning for Fine-Grained Image-Text Retrieval

Cross-modal Image-Text Retrieval with Multitask Learning

Efficient Token-Guided Image-Text Retrieval With Consistent Multimodal Contrastive Training

Iterative Uni-modal and Cross-modal Clustered Contrastive Learning for Image-text Retrieval

Cross-modal Image Retrieval with Deep Mutual Information Maximization

Cross-modality interaction reasoning for enhancing vision-language pre-training in image-text retrieval

EduCross: Dual adversarial bipartite hypergraph learning for cross-modal retrieval in multimodal educational slides

Towards Cross-Modal Text-Molecule Retrieval with Better Modality Alignment

Visual context learning based on textual knowledge for image-text retrieval

Improving Multi-Modal Learning with Uni-Modal Teachers

Rethinking Benchmarks for Cross-modal Image-text Retrieval

Cross-modal Graph Matching Network for Image-text Retrieval

Adaptive Cross-Modal Prototypes for Cross-Domain Visual-Language Retrieval

Feature Fusion Based on Transformer for Cross-modal Retrieval

A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing

Revising Image-Text Retrieval via Multi-Modal Entailment

Multicenter clinical trial of implanted norethindrone pellets for long-acting contraception in women. Program for Applied Research on Fertility Regulation.