Abstract:Vision language models (VLMs) have seen growing adoption in recent years, but many still struggle with basic spatial reasoning errors. We hypothesize that this is due to VLMs adopting pre-trained vision backbones, specifically vision transformers (ViTs) trained with image-level supervision and minimal inductive biases. Such models may fail to encode the class contents at each position in the image, and our goal is to resolve this by ensuring that the vision backbone effectively captures both local and global image semantics. Our main insight is that we do not require new supervision to learn this capability -- pre-trained models contain significant knowledge of local semantics that we can extract and use for scalable self-supervision. We propose a new efficient post-training stage for ViTs called locality alignment and a novel fine-tuning procedure called MaskEmbed that uses a masked reconstruction loss to learn semantic contributions for each image patch. We first evaluate locality alignment with a vision-only benchmark, finding that it improves a model's performance at a patch-level semantic segmentation task, especially for strong backbones trained with image-caption pairs (e.g., CLIP and SigLIP). We then train a series of VLMs with and without locality alignment, and show that locality-aligned backbones improve performance across a range of benchmarks, particularly ones that involve spatial understanding (e.g., RefCOCO, OCID-Ref, TallyQA, VSR, AI2D). Overall, we demonstrate that we can efficiently learn local semantic extraction via a locality alignment stage, and that this procedure complements existing VLM training recipes that use off-the-shelf vision backbones.

What problem does this paper attempt to address?

The problem this paper attempts to address is the current Visual Language Models (VLMs) deficiency in basic spatial understanding capabilities. Specifically, many existing VLMs make errors when handling spatial relationships, particularly performing poorly on tasks such as object localization, counting, and relational question answering. The authors believe this is mainly because these models use pre-trained visual backbone networks (e.g., Vision Transformers, ViTs), which are primarily trained through image-level supervision and lack the ability to encode local semantics. To improve this issue, the authors propose a new post-training phase called locality alignment and a new fine-tuning method called MaskEmbed. Through these methods, they aim to enable the visual backbone networks to better capture both local and global semantics of images, thereby enhancing the performance of VLMs on various benchmarks, especially on tasks requiring spatial understanding. ### Main Contributions: 1. **Introduction of locality alignment**: This is a post-training phase for ViTs that uses self-supervised learning to restore the model's ability to encode local semantics, which was originally focused on encoding global information. 2. **Proposal of the MaskEmbed method**: This is a fine-tuning method that utilizes masked embedding self-consistency to improve the model's local feature extraction capabilities without requiring additional annotated data. 3. **Experimental validation**: Through a series of experiments, the authors demonstrate the effectiveness of locality alignment and the MaskEmbed method. These experiments include pure visual tasks (such as semantic segmentation) and visual language tasks (such as RefCOCO, OCID-Ref, TallyQA, VSR, AI2D, etc.), showing that locality alignment significantly improves the model's performance on these tasks. ### Key Points of the Solution: - **Locality alignment**: Through self-supervised learning, it enables pre-trained ViTs to better capture the local semantics of images. - **MaskEmbed**: By using masked input and reconstruction loss, it learns the semantic contribution of each image patch, thereby enhancing the model's local feature extraction capabilities. ### Experimental Results: - **Visual tasks**: In the semantic segmentation task, locality alignment significantly improved the performance of multiple pre-trained models, especially for large and high-resolution models (such as CLIP ViT-L @ 336px and SigLIP SO400M @ 384px). - **Visual language tasks**: In multiple benchmarks involving spatial understanding, VLMs with locality alignment showed better performance, particularly in tasks such as object localization, text understanding, counting, and relational question answering. Overall, this paper effectively addresses the current VLMs' shortcomings in spatial understanding by introducing locality alignment and the MaskEmbed method, providing new ideas and tools for future multimodal model research.

Locality Alignment Improves Vision-Language Models

Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

Weak Supervision helps Emergence of Word-Object Alignment and improves Vision-Language Tasks

Beyond Sight: Towards Cognitive Alignment in LVLM via Enriched Visual Knowledge

Unified Lexical Representation for Interpretable Visual-Language Alignment

VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment

Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models

Vision-and-Language Navigation via Latent Semantic Alignment Learning

Contrastive Vision-Language Alignment Makes Efficient Instruction Learner

Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models

Visually-Augmented Language Modeling

Pixel Aligned Language Models

Optimization Efficient Open-World Visual Region Recognition

Global and Local Semantic Completion Learning for Vision-Language Pre-training

Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models

Barking Up The Syntactic Tree: Enhancing VLM Training with Syntactic Losses

Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models

Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language Pretraining?

Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment

Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement

Grounding Everything: Emerging Localization Properties in Vision-Language Transformers