Abstract:Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training vision encoders to generate image/text representations facilitating various applications. Recently, CLIP has been widely adopted as the vision backbone of multimodal large language models (MLLMs) to connect image inputs for language interactions. The success of CLIP as a vision-language foundation model relies on aligning web-crawled noisy text annotations at image levels. Nevertheless, such criteria may become insufficient for downstream tasks in need of fine-grained vision representations, especially when region-level understanding is demanding for MLLMs. In this paper, we improve the localization capability of CLIP with several advances. We propose a pre-training method called Contrastive Localized Language-Image Pre-training (CLOC) by complementing CLIP with region-text contrastive loss and modules. We formulate a new concept, promptable embeddings, of which the encoder produces image embeddings easy to transform into region representations given spatial hints. To support large-scale pre-training, we design a visually-enriched and spatially-localized captioning framework to effectively generate region-text pseudo-labels at scale. By scaling up to billions of annotated images, CLOC enables high-quality regional embeddings for image region recognition and retrieval tasks, and can be a drop-in replacement of CLIP to enhance MLLMs, especially on referring and grounding tasks.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that the existing Contrastive Language - Image Pretraining (CLIP) models perform excellently in global understanding of images, but are insufficient in tasks requiring fine - grained visual representations, especially in Multimodal Large Language Models (MLLMs) that require region - level understanding. Specifically, CLIP aligns the entire image with text through the contrastive learning method, but this alignment mainly emphasizes global semantics and is not fine - grained enough for tasks that require understanding specific regions of the image. To solve this problem, the authors propose a new pre - training method - Contrastive Localized Language - Image Pretraining (CLOC). CLOC enhances the localization ability of CLIP by introducing region - text contrastive loss and modules. In addition, CLOC also introduces a new concept - promptable embeddings, enabling the encoder to easily convert image embeddings into region representations according to spatial prompts. To support large - scale pre - training, the authors design a visually rich and spatially localized caption - generation framework to effectively generate large - scale region - text pseudo - labels. In summary, this paper aims to improve the localization ability of CLIP, especially to enhance its understanding ability of image regions in MLLMs, so as to better support tasks requiring fine - grained visual representations, such as object classification, region retrieval, etc. ### Main contributions: 1. **Proposing promptable embeddings**: A new learning objective is defined, that is, a powerful visual encoder should be able to easily convert image embeddings into region representations according to spatial prompts (such as box references or text prompts). 2. **Designing simple modifications**: Adds region - text contrastive loss on the basis of CLIP and uses a lightweight extraction module to extract region embeddings from image embeddings. 3. **Large - scale pseudo - labeled data engine**: Combines a visually rich image caption generator and an open - vocabulary detector to generate a 200 - million - image - text dataset containing fine - grained region - text labels. 4. **Experimental verification**: Extensive experiments show that CLOC significantly and consistently outperforms CLIP on multiple evaluation tasks, especially in tasks involving MLLMs. These improvements enable CLOC to maintain the original powerful image - level knowledge of CLIP while enhancing its performance in fine - grained visual tasks.

Contrastive Localized Language-Image Pre-Training

ProtoCLIP: Prototypical Contrastive Language Image Pretraining

Non-Contrastive Learning Meets Language-Image Pre-Training

Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese

MLIP: Efficient Multi-Perspective Language-Image Pretraining with Exhaustive Data Utilization

Iclip: Bridging Image Classification and Contrastive Language-Image Pre-Training for Visual Recognition

SpaceCLIP: A Vision-Language Pretraining Framework With Spatial Reconstruction On Text

Modeling Caption Diversity in Contrastive Vision-Language Pretraining

Improving CLIP Training with Language Rewrites

Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision

Enhancing Multimodal Understanding with CLIP-Based Image-to-Text Transformation

How Much Can CLIP Benefit Vision-and-Language Tasks?

TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary Multi-Label Classification of CLIP Without Training

LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models

CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning

LocCa: Visual Pretraining with Location-aware Captioners

Multilingual Vision-Language Pre-training for the Remote Sensing Domain