Abstract:While vision-language models like CLIP have shown remarkable success in open-vocabulary tasks, their application is currently confined to image-level tasks, and they still struggle with dense predictions. Recent works often attribute such deficiency in dense predictions to the self-attention layers in the final block, and have achieved commendable results by modifying the original query-key attention to self-correlation attention, (e.g., query-query and key-key attention). However, these methods overlook the cross-correlation attention (query-key) properties, which capture the rich spatial correspondence. In this paper, we reveal that the cross-correlation of the self-attention in CLIP's non-final layers also exhibits localization properties. Therefore, we propose the Residual Cross-correlation Self-attention (RCS) module, which leverages the cross-correlation self-attention from intermediate layers to remold the attention in the final block. The RCS module effectively reorganizes spatial information, unleashing the localization potential within CLIP for dense vision-language inference. Furthermore, to enhance the focus on regions of the same categories and local consistency, we propose the Semantic Feedback Refinement (SFR) module, which utilizes semantic segmentation maps to further adjust the attention scores. By integrating these two strategies, our method, termed ResCLIP, can be easily incorporated into existing approaches as a plug-and-play module, significantly boosting their performance in dense vision-language inference. Extensive experiments across multiple standard benchmarks demonstrate that our method surpasses state-of-the-art training-free methods, validating the effectiveness of the proposed approach. Code is available at <a class="link-external link-https" href="https://github.com/yvhangyang/ResCLIP" rel="external noopener nofollow">this https URL</a>.

SpaceCLIP: A Vision-Language Pretraining Framework With Spatial Reconstruction On Text

Multi-CLIP: Contrastive Vision-Language Pre-training for Question Answering tasks in 3D Scenes

How Much Can CLIP Benefit Vision-and-Language Tasks?

CLIP2: Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data

Selective Vision-Language Subspace Projection for Few-shot CLIP

VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

Contrastive Localized Language-Image Pre-Training

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning.

ProtoCLIP: Prototypical Contrastive Language Image Pretraining

A Progressive Framework of Vision-language Knowledge Distillation and Alignment for Multilingual Scene

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Alignment

CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model

Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision

CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement

Improving CLIP Training with Language Rewrites

Non-Contrastive Learning Meets Language-Image Pre-Training

ResCLIP: Residual Attention for Training-free Dense Vision-language Inference

CAVL: Learning Contrastive and Adaptive Representations of Vision and Language

CLIPVQA:Video Quality Assessment via CLIP