FALIP: Visual Prompt as Foveal Attention Boosts CLIP Zero-Shot Performance

Jiedong Zhuang,Jiaqi Hu,Lianrui Mu,Rui Hu,Xiaoyu Liang,Jiangnan Ye,Haoji Hu
2024-08-21
Abstract:CLIP has achieved impressive zero-shot performance after pre-training on a large-scale dataset consisting of paired image-text data. Previous works have utilized CLIP by incorporating manually designed visual prompts like colored circles and blur masks into the images to guide the model's attention, showing enhanced zero-shot performance in downstream tasks. Although these methods have achieved promising results, they inevitably alter the original information of the images, which can lead to failure in specific tasks. We propose a train-free method Foveal-Attention CLIP (FALIP), which adjusts the CLIP's attention by inserting foveal attention masks into the multi-head self-attention module. We demonstrate FALIP effectively boosts CLIP zero-shot performance in tasks such as referring expressions comprehension, image classification, and 3D point cloud recognition. Experimental results further show that FALIP outperforms existing methods on most metrics and can augment current methods to enhance their performance.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is that when existing visual cueing methods enhance the zero - shot performance of CLIP, they will inevitably change the original image information, thus affecting the accuracy of specific tasks. The authors propose a training - free method - FALIP (Foveal - Attention CLIP). By inserting a foveal - attention mask in the multi - head self - attention module to adjust the attention mechanism of CLIP, its zero - shot performance can be improved without changing the original image content. Specifically, the main contributions of the paper are as follows: 1. **Proposing FALIP**: A new method that adaptively guides the attention of CLIP during the inference process without additional training. 2. **Extensive evaluation**: FALIP has been extensively evaluated on multiple tasks and datasets, demonstrating competitive performance compared to existing methods. 3. **In - depth analysis**: Reveals the reasons for the effectiveness of visual cues and provides new insights for improving the zero - shot reasoning ability of CLIP. 4. **Discovering attention - head sensitivity**: Different attention heads have different sensitivities to visual cues, and by adjusting these heads, the potential of visual cues can be further released. Through these contributions, the paper not only solves the problem of existing methods changing image information but also provides new directions and tools for future research.