Contrastive Localized Language-Image Pre-Training

Hong-You Chen,Zhengfeng Lai,Haotian Zhang,Xinze Wang,Marcin Eichner,Keen You,Meng Cao,Bowen Zhang,Yinfei Yang,Zhe Gan
2024-10-04
Abstract:Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training vision encoders to generate image/text representations facilitating various applications. Recently, CLIP has been widely adopted as the vision backbone of multimodal large language models (MLLMs) to connect image inputs for language interactions. The success of CLIP as a vision-language foundation model relies on aligning web-crawled noisy text annotations at image levels. Nevertheless, such criteria may become insufficient for downstream tasks in need of fine-grained vision representations, especially when region-level understanding is demanding for MLLMs. In this paper, we improve the localization capability of CLIP with several advances. We propose a pre-training method called Contrastive Localized Language-Image Pre-training (CLOC) by complementing CLIP with region-text contrastive loss and modules. We formulate a new concept, promptable embeddings, of which the encoder produces image embeddings easy to transform into region representations given spatial hints. To support large-scale pre-training, we design a visually-enriched and spatially-localized captioning framework to effectively generate region-text pseudo-labels at scale. By scaling up to billions of annotated images, CLOC enables high-quality regional embeddings for image region recognition and retrieval tasks, and can be a drop-in replacement of CLIP to enhance MLLMs, especially on referring and grounding tasks.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that the existing Contrastive Language - Image Pretraining (CLIP) models perform excellently in global understanding of images, but are insufficient in tasks requiring fine - grained visual representations, especially in Multimodal Large Language Models (MLLMs) that require region - level understanding. Specifically, CLIP aligns the entire image with text through the contrastive learning method, but this alignment mainly emphasizes global semantics and is not fine - grained enough for tasks that require understanding specific regions of the image. To solve this problem, the authors propose a new pre - training method - Contrastive Localized Language - Image Pretraining (CLOC). CLOC enhances the localization ability of CLIP by introducing region - text contrastive loss and modules. In addition, CLOC also introduces a new concept - promptable embeddings, enabling the encoder to easily convert image embeddings into region representations according to spatial prompts. To support large - scale pre - training, the authors design a visually rich and spatially localized caption - generation framework to effectively generate large - scale region - text pseudo - labels. In summary, this paper aims to improve the localization ability of CLIP, especially to enhance its understanding ability of image regions in MLLMs, so as to better support tasks requiring fine - grained visual representations, such as object classification, region retrieval, etc. ### Main contributions: 1. **Proposing promptable embeddings**: A new learning objective is defined, that is, a powerful visual encoder should be able to easily convert image embeddings into region representations according to spatial prompts (such as box references or text prompts). 2. **Designing simple modifications**: Adds region - text contrastive loss on the basis of CLIP and uses a lightweight extraction module to extract region embeddings from image embeddings. 3. **Large - scale pseudo - labeled data engine**: Combines a visually rich image caption generator and an open - vocabulary detector to generate a 200 - million - image - text dataset containing fine - grained region - text labels. 4. **Experimental verification**: Extensive experiments show that CLOC significantly and consistently outperforms CLIP on multiple evaluation tasks, especially in tasks involving MLLMs. These improvements enable CLOC to maintain the original powerful image - level knowledge of CLIP while enhancing its performance in fine - grained visual tasks.