Locality Alignment Improves Vision-Language Models

Ian Covert,Tony Sun,James Zou,Tatsunori Hashimoto
2024-10-15
Abstract:Vision language models (VLMs) have seen growing adoption in recent years, but many still struggle with basic spatial reasoning errors. We hypothesize that this is due to VLMs adopting pre-trained vision backbones, specifically vision transformers (ViTs) trained with image-level supervision and minimal inductive biases. Such models may fail to encode the class contents at each position in the image, and our goal is to resolve this by ensuring that the vision backbone effectively captures both local and global image semantics. Our main insight is that we do not require new supervision to learn this capability -- pre-trained models contain significant knowledge of local semantics that we can extract and use for scalable self-supervision. We propose a new efficient post-training stage for ViTs called locality alignment and a novel fine-tuning procedure called MaskEmbed that uses a masked reconstruction loss to learn semantic contributions for each image patch. We first evaluate locality alignment with a vision-only benchmark, finding that it improves a model's performance at a patch-level semantic segmentation task, especially for strong backbones trained with image-caption pairs (e.g., CLIP and SigLIP). We then train a series of VLMs with and without locality alignment, and show that locality-aligned backbones improve performance across a range of benchmarks, particularly ones that involve spatial understanding (e.g., RefCOCO, OCID-Ref, TallyQA, VSR, AI2D). Overall, we demonstrate that we can efficiently learn local semantic extraction via a locality alignment stage, and that this procedure complements existing VLM training recipes that use off-the-shelf vision backbones.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem this paper attempts to address is the current Visual Language Models (VLMs) deficiency in basic spatial understanding capabilities. Specifically, many existing VLMs make errors when handling spatial relationships, particularly performing poorly on tasks such as object localization, counting, and relational question answering. The authors believe this is mainly because these models use pre-trained visual backbone networks (e.g., Vision Transformers, ViTs), which are primarily trained through image-level supervision and lack the ability to encode local semantics. To improve this issue, the authors propose a new post-training phase called locality alignment and a new fine-tuning method called MaskEmbed. Through these methods, they aim to enable the visual backbone networks to better capture both local and global semantics of images, thereby enhancing the performance of VLMs on various benchmarks, especially on tasks requiring spatial understanding. ### Main Contributions: 1. **Introduction of locality alignment**: This is a post-training phase for ViTs that uses self-supervised learning to restore the model's ability to encode local semantics, which was originally focused on encoding global information. 2. **Proposal of the MaskEmbed method**: This is a fine-tuning method that utilizes masked embedding self-consistency to improve the model's local feature extraction capabilities without requiring additional annotated data. 3. **Experimental validation**: Through a series of experiments, the authors demonstrate the effectiveness of locality alignment and the MaskEmbed method. These experiments include pure visual tasks (such as semantic segmentation) and visual language tasks (such as RefCOCO, OCID-Ref, TallyQA, VSR, AI2D, etc.), showing that locality alignment significantly improves the model's performance on these tasks. ### Key Points of the Solution: - **Locality alignment**: Through self-supervised learning, it enables pre-trained ViTs to better capture the local semantics of images. - **MaskEmbed**: By using masked input and reconstruction loss, it learns the semantic contribution of each image patch, thereby enhancing the model's local feature extraction capabilities. ### Experimental Results: - **Visual tasks**: In the semantic segmentation task, locality alignment significantly improved the performance of multiple pre-trained models, especially for large and high-resolution models (such as CLIP ViT-L @ 336px and SigLIP SO400M @ 384px). - **Visual language tasks**: In multiple benchmarks involving spatial understanding, VLMs with locality alignment showed better performance, particularly in tasks such as object localization, text understanding, counting, and relational question answering. Overall, this paper effectively addresses the current VLMs' shortcomings in spatial understanding by introducing locality alignment and the MaskEmbed method, providing new ideas and tools for future multimodal model research.