Assessing Brittleness of Image-Text Retrieval Benchmarks from Vision-Language Models Perspective

Mariya Hendriksen,Shuo Zhang,Ridho Reinanda,Mohamed Yahya,Edgar Meij,Maarten de Rijke
2024-10-29
Abstract:We examine the brittleness of the image-text retrieval (ITR) evaluation pipeline with a focus on concept granularity. We start by analyzing two common benchmarks, MS-COCO and Flickr30k, and compare them with augmented, fine-grained versions, MS-COCO-FG and Flickr30k-FG, given a specified set of linguistic features capturing concept granularity. Flickr30k-FG and MS COCO-FG consistently give rise to higher scores across all the selected features. To further our understanding of the impact of granularity we consider a novel taxonomy of query perturbations. We apply these perturbations to the selected datasets. We evaluate four diverse state-of-the-art Vision-Language models on both the standard and fine-grained datasets under zero-shot conditions, with and without the applied perturbations. The results demonstrate that although perturbations generally degrade model performance, the fine-grained datasets exhibit a smaller performance drop than their standard counterparts. The relative performance drop across all setups is consistent across all models and datasets, indicating that the issue lies within the benchmarks themselves. We conclude by providing an agenda for improving ITR evaluation pipelines.
Computer Vision and Pattern Recognition,Artificial Intelligence,Information Retrieval
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on the brittleness of the Image - Text Retrieval (ITR) evaluation benchmark. Specifically: 1. **Concept Granularity**: - Firstly, the paper focuses on the concept granularity problem of ITR datasets. Concept granularity refers to the specificity of the relationship between images and text descriptions. Existing benchmark datasets such as MS - COCO and Flickr30k usually use coarse - grained descriptions, which makes it difficult to evaluate the model's ability to recognize specific objects or attributes. - For this reason, the researchers introduced fine - grained enhanced versions of the datasets, such as MS - COCO - FG and Flickr30k - FG. These datasets contain more contextual details, thus providing more detailed descriptions. 2. **Model Robustness**: - The paper also explores the concept granularity problem from the perspective of model robustness. In practical applications, the image - text retrieval task faces challenges of noise and variation, such as semantic drift and spelling mistakes, which will reduce the model's performance. - The researchers proposed a new evaluation framework. By introducing input perturbations (such as word - order - sensitivity tests, noisy - input tests, etc.), the robustness of the model is evaluated. These perturbations include word - order rearrangement, local word - order changes, interfering information, lexical variants and spelling mistakes, etc. 3. **Evaluation Metrics**: - The paper points out that the existing ITR evaluation metrics usually rely on binary matching between images and text, ignoring the partial semantic overlap that may exist in the real world. Therefore, the researchers proposed a cross - modal evaluation metric, which not only considers the case of perfect matching, but also evaluates the semantic similarity between the query and the candidates. In conclusion, this paper aims to improve the evaluation quality of ITR tasks and the robustness of the model by analyzing the influence of fine - grained datasets on model performance and introducing new evaluation frameworks and metrics.