Abstract:Vision and Language Models (VLMs) continue to demonstrate remarkable zero-shot (ZS) performance across various tasks. However, many probing studies have revealed that even the best-performing VLMs struggle to capture aspects of compositional scene understanding, lacking the ability to properly ground and localize linguistic phrases in images. Recent VLM advancements include scaling up both model and dataset sizes, additional training objectives and levels of supervision, and variations in the model architectures. To characterize the grounding ability of VLMs, such as phrase grounding, referring expressions comprehension, and relationship understanding, Pointing Game has been used as an evaluation metric for datasets with bounding box annotations. In this paper, we introduce a novel suite of quantitative metrics that utilize GradCAM activations to rigorously evaluate the grounding capabilities of pre-trained VLMs like CLIP, BLIP, and ALBEF. These metrics offer an explainable and quantifiable approach for a more detailed comparison of the zero-shot capabilities of VLMs and enable measuring models' grounding uncertainty. This characterization reveals interesting tradeoffs between the size of the model, the dataset size, and their performance.

What problem does this paper attempt to address?

This paper attempts to address the issues that Vision and Language Models (VLMs) face in understanding scene composition, particularly their poor performance in correctly localizing language phrases to specific regions in images. Specifically, although existing VLMs have demonstrated excellent zero-shot performance across various tasks, they still struggle to capture complex scene understanding and correctly align language phrases with objects in images. To evaluate and quantify the performance of VLMs in this aspect, the authors introduce a new set of quantitative metrics that utilize GradCAM activation maps to rigorously assess the alignment capabilities of pre-trained VLMs such as CLIP, BLIP, and ALBEF. These new metrics not only provide an interpretable and quantitative evaluation method but also allow for a more detailed comparison of the zero-shot capabilities of different VLMs and measure the alignment uncertainty of the models. The main issues include: 1. **Limitations of existing evaluation methods**: Traditional Pointing Game methods can only provide a coarse 0/1 evaluation, are susceptible to local maxima, and cannot well reflect the model's confidence in aligning concepts. 2. **Fine-grained evaluation of model alignment capabilities**: There is a need for more detailed methods to evaluate the alignment performance of models on different datasets, especially in cases where there are multiple high-confidence activation points and some points are outside the true bounding box. 3. **Impact of model size and dataset size**: Exploring the impact of model size and dataset size on alignment performance, revealing the trade-off between model scale and data scale. By introducing these new evaluation metrics, the paper aims to provide a more comprehensive and detailed evaluation method to help researchers better understand and improve the alignment capabilities of VLMs.

Q-GroundCAM: Quantifying Grounding in Vision Language Models via GradCAM

Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models

VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding

Learning Visual Grounding from Generative Vision and Language Model

GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection

Grounding Descriptions in Images informs Zero-Shot Visual Recognition

Learning to Ground VLMs without Forgetting

GeoGround: A Unified Large Vision-Language Model. for Remote Sensing Visual Grounding

LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent

Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision

Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling

GROUNDHOG: Grounding Large Language Models to Holistic Segmentation

Learning Comprehensive Visual Grounding for Video Captioning

SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding

Grounded 3D-LLM with Referent Tokens

LLM4VG: Large Language Models Evaluation for Video Grounding

GLaMM: Pixel Grounding Large Multimodal Model

Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models

MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs

VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos