Abstract:Recent advances in zero-shot image recognition suggest that vision-language models learn generic visual representations with a high degree of semantic information that may be arbitrarily probed with natural language phrases. Understanding an image, however, is not just about understanding what content resides within an image, but importantly, where that content resides. In this work we examine how well vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery. We demonstrate how contemporary vision and language representation learning models based on contrastive losses and large web-based data capture limited object localization information. We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information. We measure this performance in terms of zero-shot image recognition, unsupervised bottom-up and top-down semantic segmentations, as well as robustness analyses. We find that the resulting model achieves state-of-the-art results in terms of unsupervised segmentation, and demonstrate that the learned representations are uniquely robust to spurious correlations in datasets designed to probe the causal behavior of vision models.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that the existing vision - language models have limited ability in understanding and locating the positions of objects in images. Although these models can learn general and highly semantically - informative visual representations and can recognize image content in a zero - shot situation, they perform poorly in understanding where the objects in the image are specifically located and how to group visually related parts together. The author points out that the existing models for learning visual and language representations based on contrastive loss and large - scale network data capture limited object - location information. To meet this challenge, the paper proposes a minimal set of modifications that enable the model to learn semantic and spatial information simultaneously. In this way, the model can not only perform well in zero - shot image recognition tasks, but also achieve state - of - the - art results in unsupervised bottom - up and top - down semantic segmentation tasks. Moreover, the paper also demonstrates the unique robustness of the learned representations against spurious correlations in the dataset designed to probe the causal behavior of visual models. Specifically, the main contributions of the paper include: 1. **Identifying Systematic Failures**: Revealing the systematic failures of contrastive vision - language models in correctly identifying the positions of objects in images and grouping semantically related content. 2. **Designing Minimal Changes**: Proposing a minimal set of changes that enable these models to have the ability of perceptual grouping, thus achieving state - of - the - art zero - shot segmentation performance without using any segmentation data or performing task - specific fine - tuning. 3. **Enhancing Robustness**: Demonstrating the unique robustness of the model against factual manipulation, with a degree of robustness equal to or even exceeding that of previous supervised learning methods using specialized training methods. Through these improvements, the paper aims to enhance the ability of vision - language models in understanding image content and its spatial layout, making them more general and robust.

Perceptual Grouping in Contrastive Vision-Language Models

Explainable Semantic Space by Grounding Language to Vision with Cross-Modal Contrastive Learning

Localization vs. Semantics: Visual Representations in Unimodal and Multimodal Models

Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding

Superpixel Semantics Representation and Pre-training for Vision-Language Task

LLM meets Vision-Language Models for Zero-Shot One-Class Classification

Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions

Pre-trained Vision-Language Models Learn Discoverable Visual Concepts

Locality Alignment Improves Vision-Language Models

Refining Skewed Perceptions in Vision-Language Models through Visual Representations

Grounding Everything: Emerging Localization Properties in Vision-Language Transformers

Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling

Teaching Structured Vision&Language Concepts to Vision&Language Models

GroupContrast: Semantic-aware Self-supervised Representation Learning for 3D Understanding

Two Effects, One Trigger: On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language Models

What Remains of Visual Semantic Embeddings

Semantics-Guided Contrastive Network for Zero-Shot Object detection

Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models

UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding

CAVL: Learning Contrastive and Adaptive Representations of Vision and Language

Neural Implicit Vision-Language Feature Fields