VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

Longtian Qiu,Renrui Zhang,Ziyu Guo,Ziyao Zeng,Zilu Guo,Yafeng Li,Guangnan Zhang
2023-08-10
Abstract:Contrastive Language-Image Pre-training (CLIP) has drawn increasing attention recently for its transferable visual representation learning. However, due to the semantic gap within datasets, CLIP's pre-trained image-text alignment becomes sub-optimal on downstream tasks, which severely harms its transferring performance. To better adapt the cross-modality embedding space, we propose to enhance CLIP via Visual-guided Texts, named VT-CLIP. Specifically, we guide textual features of different categories to adaptively explore informative regions on the image and aggregate visual features by attention mechanisms. In this way, the texts become visual-guided, namely, more semantically correlated with downstream images, which greatly benefits the category-wise matching process. In few-shot settings, we evaluate our VT-CLIP on 11 well-known classification datasets to demonstrate its effectiveness.
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in vision - language models (such as CLIP), due to the semantic gap within the dataset, the pre - trained image - text alignment performs poorly on downstream tasks, which seriously affects its transfer performance. In order to better adapt to the cross - modal embedding space, the author proposes a method to enhance CLIP through Visual - guided Texts (VT - CLIP). Specifically, VT - CLIP conducts feature communication after the two encoders by introducing a visual - guided attention module, where the text serves as queries, and the image serves as keys and values. In this way, texts of different categories can adaptively explore the information - rich regions in the image and aggregate relevant visual features according to their attention scores. This method makes the text more semantically relevant to the downstream image, thus greatly facilitating the category - matching process. In the few - shot setting, the author evaluated the effectiveness of VT - CLIP on 11 well - known classification datasets. The results show that VT - CLIP outperforms other baseline methods on these datasets. In particular, when the number of samples is small, VT - CLIP performs more stably and excellently. Moreover, VT - CLIP shows consistent superiority on all 11 datasets, demonstrating the effectiveness and generalization ability of its method.