Abstract:We examine the brittleness of the image-text retrieval (ITR) evaluation pipeline with a focus on concept granularity. We start by analyzing two common benchmarks, MS-COCO and Flickr30k, and compare them with augmented, fine-grained versions, MS-COCO-FG and Flickr30k-FG, given a specified set of linguistic features capturing concept granularity. Flickr30k-FG and MS COCO-FG consistently give rise to higher scores across all the selected features. To further our understanding of the impact of granularity we consider a novel taxonomy of query perturbations. We apply these perturbations to the selected datasets. We evaluate four diverse state-of-the-art Vision-Language models on both the standard and fine-grained datasets under zero-shot conditions, with and without the applied perturbations. The results demonstrate that although perturbations generally degrade model performance, the fine-grained datasets exhibit a smaller performance drop than their standard counterparts. The relative performance drop across all setups is consistent across all models and datasets, indicating that the issue lies within the benchmarks themselves. We conclude by providing an agenda for improving ITR evaluation pipelines.

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on the brittleness of the Image - Text Retrieval (ITR) evaluation benchmark. Specifically: 1. **Concept Granularity**: - Firstly, the paper focuses on the concept granularity problem of ITR datasets. Concept granularity refers to the specificity of the relationship between images and text descriptions. Existing benchmark datasets such as MS - COCO and Flickr30k usually use coarse - grained descriptions, which makes it difficult to evaluate the model's ability to recognize specific objects or attributes. - For this reason, the researchers introduced fine - grained enhanced versions of the datasets, such as MS - COCO - FG and Flickr30k - FG. These datasets contain more contextual details, thus providing more detailed descriptions. 2. **Model Robustness**: - The paper also explores the concept granularity problem from the perspective of model robustness. In practical applications, the image - text retrieval task faces challenges of noise and variation, such as semantic drift and spelling mistakes, which will reduce the model's performance. - The researchers proposed a new evaluation framework. By introducing input perturbations (such as word - order - sensitivity tests, noisy - input tests, etc.), the robustness of the model is evaluated. These perturbations include word - order rearrangement, local word - order changes, interfering information, lexical variants and spelling mistakes, etc. 3. **Evaluation Metrics**: - The paper points out that the existing ITR evaluation metrics usually rely on binary matching between images and text, ignoring the partial semantic overlap that may exist in the real world. Therefore, the researchers proposed a cross - modal evaluation metric, which not only considers the case of perfect matching, but also evaluates the semantic similarity between the query and the candidates. In conclusion, this paper aims to improve the evaluation quality of ITR tasks and the robustness of the model by analyzing the influence of fine - grained datasets on model performance and introducing new evaluation frameworks and metrics.

Assessing Brittleness of Image-Text Retrieval Benchmarks from Vision-Language Models Perspective

Assessing Brittleness of Image-Text Retrieval Benchmarks from Vision-Language Models Perspective

Rethinking Benchmarks for Cross-modal Image-text Retrieval

FiCo-ITR: bridging fine-grained and coarse-grained image-text retrieval for comparative performance analysis

RoCOCO: Robustness Benchmark of MS-COCO to Stress-test Image-Text Matching Models

Holistic Evaluation of Text-To-Image Models

Where Does the Performance Improvement Come From? - A Reproducibility Concern about Image-Text Retrieval

Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models

Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark

From Local Concepts to Universals: Evaluating the Multicultural Understanding of Vision-Language Models

Getting it Right: Improving Spatial Consistency in Text-to-Image Models

VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning

MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models

VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models

Lost in Translation? Translation Errors and Challenges for Fair Assessment of Text-to-Image Models on Multilingual Concepts

JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated Images

VISLA Benchmark: Evaluating Embedding Sensitivity to Semantic and Lexical Alterations

Words Aren't Enough, Their Order Matters: on the Robustness of Grounding Visual Referring Expressions.

GRIT: General Robust Image Task Benchmark

@Bench: Benchmarking Vision-Language Models for Human-centered Assistive Technology

Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms