FiCo-ITR: bridging fine-grained and coarse-grained image-text retrieval for comparative performance analysis

Mikel Williams-Lekuona,Georgina Cosma
2024-07-29
Abstract:In the field of Image-Text Retrieval (ITR), recent advancements have leveraged large-scale Vision-Language Pretraining (VLP) for Fine-Grained (FG) instance-level retrieval, achieving high accuracy at the cost of increased computational complexity. For Coarse-Grained (CG) category-level retrieval, prominent approaches employ Cross-Modal Hashing (CMH) to prioritise efficiency, albeit at the cost of retrieval performance. Due to differences in methodologies, FG and CG models are rarely compared directly within evaluations in the literature, resulting in a lack of empirical data quantifying the retrieval performance-efficiency tradeoffs between the two. This paper addresses this gap by introducing the \texttt{FiCo-ITR} library, which standardises evaluation methodologies for both FG and CG models, facilitating direct comparisons. We conduct empirical evaluations of representative models from both subfields, analysing precision, recall, and computational complexity across varying data scales. Our findings offer new insights into the performance-efficiency trade-offs between recent representative FG and CG models, highlighting their respective strengths and limitations. These findings provide the foundation necessary to make more informed decisions regarding model selection for specific retrieval tasks and highlight avenues for future research into hybrid systems that leverage the strengths of both FG and CG approaches.
Information Retrieval,Artificial Intelligence,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily addresses the issues in the comparative analysis between two subfields of Image-Text Retrieval (ITR) — Fine-Grained (FG) ITR and Coarse-Grained (CG) ITR. Specifically, the paper aims to solve the following three main problems: 1. **Direct comparison difficulties due to methodological differences**: Due to the technical differences between FG and CG methods, directly obtaining comparable quantitative results is a challenging task. 2. **Lack of comprehensive benchmarking**: Although there have been comparative reviews of FG and CG methods in the literature, empirical comparative evaluations of recent representative models are relatively scarce. 3. **Small-scale datasets limit scalability assessment**: Traditional ITR benchmark datasets like Flickr30K and MS-COCO are relatively small compared to large-scale datasets used in real-world applications, which may lead to a biased understanding of the trade-off between retrieval performance and efficiency of the models. To address these issues, the paper proposes the FiCo-ITR library and toolkit, aiming to standardize the evaluation methods of FG and CG models, making direct comparisons possible. Additionally, the paper empirically evaluates the accuracy, recall, and computational complexity of representative FG and CG models and explores their performance across different data scales. The ultimate goal is to provide a basis for selecting appropriate models for specific retrieval tasks and to guide future research on integrating the strengths of FG and CG methods.