FiCo-ITR: bridging fine-grained and coarse-grained image-text retrieval for comparative performance analysis

Mikel Williams-Lekuona,Georgina Cosma

2024-07-29

Abstract:In the field of Image-Text Retrieval (ITR), recent advancements have leveraged large-scale Vision-Language Pretraining (VLP) for Fine-Grained (FG) instance-level retrieval, achieving high accuracy at the cost of increased computational complexity. For Coarse-Grained (CG) category-level retrieval, prominent approaches employ Cross-Modal Hashing (CMH) to prioritise efficiency, albeit at the cost of retrieval performance. Due to differences in methodologies, FG and CG models are rarely compared directly within evaluations in the literature, resulting in a lack of empirical data quantifying the retrieval performance-efficiency tradeoffs between the two. This paper addresses this gap by introducing the \texttt{FiCo-ITR} library, which standardises evaluation methodologies for both FG and CG models, facilitating direct comparisons. We conduct empirical evaluations of representative models from both subfields, analysing precision, recall, and computational complexity across varying data scales. Our findings offer new insights into the performance-efficiency trade-offs between recent representative FG and CG models, highlighting their respective strengths and limitations. These findings provide the foundation necessary to make more informed decisions regarding model selection for specific retrieval tasks and highlight avenues for future research into hybrid systems that leverage the strengths of both FG and CG approaches.

Information Retrieval,Artificial Intelligence,Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper primarily addresses the issues in the comparative analysis between two subfields of Image-Text Retrieval (ITR) — Fine-Grained (FG) ITR and Coarse-Grained (CG) ITR. Specifically, the paper aims to solve the following three main problems: 1. **Direct comparison difficulties due to methodological differences**: Due to the technical differences between FG and CG methods, directly obtaining comparable quantitative results is a challenging task. 2. **Lack of comprehensive benchmarking**: Although there have been comparative reviews of FG and CG methods in the literature, empirical comparative evaluations of recent representative models are relatively scarce. 3. **Small-scale datasets limit scalability assessment**: Traditional ITR benchmark datasets like Flickr30K and MS-COCO are relatively small compared to large-scale datasets used in real-world applications, which may lead to a biased understanding of the trade-off between retrieval performance and efficiency of the models. To address these issues, the paper proposes the FiCo-ITR library and toolkit, aiming to standardize the evaluation methods of FG and CG models, making direct comparisons possible. Additionally, the paper empirically evaluates the accuracy, recall, and computational complexity of representative FG and CG models and explores their performance across different data scales. The ultimate goal is to provide a basis for selecting appropriate models for specific retrieval tasks and to guide future research on integrating the strengths of FG and CG methods.

FiCo-ITR: bridging fine-grained and coarse-grained image-text retrieval for comparative performance analysis

Assessing Brittleness of Image-Text Retrieval Benchmarks from Vision-Language Models Perspective

Image-text Retrieval: A Survey on Recent Research and Development

CFIR: Fast and Effective Long-Text To Image Retrieval for Large Corpora

CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for Image-Text Retrieval

LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Sparse Retrieval

DVF: Advancing Robust and Accurate Fine-Grained Image Retrieval with Retrieval Guidelines

RSITR-FFT: Efficient Fine-Grained Fine-Tuning Framework With Consistency Regularization for Remote Sensing Image-Text Retrieval

Fine-grained Image Retrieval by Combining Attention Mechanism and Context Information

Unified Coarse-to-Fine Alignment for Video-Text Retrieval

LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Retrieval

Efficient Token-Guided Image-Text Retrieval With Consistent Multimodal Contrastive Training

Interacting-Enhancing Feature Transformer for Cross-modal Remote Sensing Image and Text Retrieval

USER: Unified Semantic Enhancement With Momentum Contrast for Image-Text Retrieval

Integrating listwise ranking into pairwise-based image-text retrieval

Flickr30K-CFQ: A Compact and Fragmented Query Dataset for Text-image Retrieval

Rethinking Benchmarks for Cross-modal Image-text Retrieval

Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning

VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models

A Fusion-Based Contrastive Learning Model for Cross-Modal Remote Sensing Retrieval