VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

Longtian Qiu,Renrui Zhang,Ziyu Guo,Ziyao Zeng,Zilu Guo,Yafeng Li,Guangnan Zhang

2023-08-10

Abstract:Contrastive Language-Image Pre-training (CLIP) has drawn increasing attention recently for its transferable visual representation learning. However, due to the semantic gap within datasets, CLIP's pre-trained image-text alignment becomes sub-optimal on downstream tasks, which severely harms its transferring performance. To better adapt the cross-modality embedding space, we propose to enhance CLIP via Visual-guided Texts, named VT-CLIP. Specifically, we guide textual features of different categories to adaptively explore informative regions on the image and aggregate visual features by attention mechanisms. In this way, the texts become visual-guided, namely, more semantically correlated with downstream images, which greatly benefits the category-wise matching process. In few-shot settings, we evaluate our VT-CLIP on 11 well-known classification datasets to demonstrate its effectiveness.

Computer Vision and Pattern Recognition,Computation and Language

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in vision - language models (such as CLIP), due to the semantic gap within the dataset, the pre - trained image - text alignment performs poorly on downstream tasks, which seriously affects its transfer performance. In order to better adapt to the cross - modal embedding space, the author proposes a method to enhance CLIP through Visual - guided Texts (VT - CLIP). Specifically, VT - CLIP conducts feature communication after the two encoders by introducing a visual - guided attention module, where the text serves as queries, and the image serves as keys and values. In this way, texts of different categories can adaptively explore the information - rich regions in the image and aggregate relevant visual features according to their attention scores. This method makes the text more semantically relevant to the downstream image, thus greatly facilitating the category - matching process. In the few - shot setting, the author evaluated the effectiveness of VT - CLIP on 11 well - known classification datasets. The results show that VT - CLIP outperforms other baseline methods on these datasets. In particular, when the number of samples is small, VT - CLIP performs more stably and excellently. Moreover, VT - CLIP shows consistent superiority on all 11 datasets, demonstrating the effectiveness and generalization ability of its method.

VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

How Much Can CLIP Benefit Vision-and-Language Tasks?

Enhancing Multimodal Understanding with CLIP-Based Image-to-Text Transformation

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Alignment

CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

SpaceCLIP: A Vision-Language Pretraining Framework With Spatial Reconstruction On Text

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

Iclip: Bridging Image Classification and Contrastive Language-Image Pre-Training for Visual Recognition

ProtoCLIP: Prototypical Contrastive Language Image Pretraining

LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models

VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending

Boosting Visual-Language Models by Exploiting Hard Samples

Toward Modality Gap: Vision Prototype Learning for Weakly-supervised Semantic Segmentation with CLIP

Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training

Image–Text Matching Model Based on CLIP Bimodal Encoding

The Solution for Language-Enhanced Image New Category Discovery

Text and Image Are Mutually Beneficial: Enhancing Training-Free Few-Shot Classification with CLIP

RankCLIP: Ranking-Consistent Language-Image Pretraining

Transferring Image-CLIP to Video-Text Retrieval via Temporal Relations