ProxyCLIP: Proxy Attention Improves CLIP for Open-Vocabulary Segmentation

Mengcheng Lan,Chaofeng Chen,Yiping Ke,Xinjiang Wang,Litong Feng,Wayne Zhang

2024-08-09

Abstract:Open-vocabulary semantic segmentation requires models to effectively integrate visual representations with open-vocabulary semantic labels. While Contrastive Language-Image Pre-training (CLIP) models shine in recognizing visual concepts from text, they often struggle with segment coherence due to their limited localization ability. In contrast, Vision Foundation Models (VFMs) excel at acquiring spatially consistent local visual representations, yet they fall short in semantic understanding. This paper introduces ProxyCLIP, an innovative framework designed to harmonize the strengths of both CLIP and VFMs, facilitating enhanced open-vocabulary semantic segmentation. ProxyCLIP leverages the spatial feature correspondence from VFMs as a form of proxy attention to augment CLIP, thereby inheriting the VFMs' robust local consistency and maintaining CLIP's exceptional zero-shot transfer capacity. We propose an adaptive normalization and masking strategy to get the proxy attention from VFMs, allowing for adaptation across different VFMs. Remarkably, as a training-free approach, ProxyCLIP significantly improves the average mean Intersection over Union (mIoU) across eight benchmarks from 40.3 to 44.4, showcasing its exceptional efficacy in bridging the gap between spatial precision and semantic richness for the open-vocabulary segmentation task.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to address several key challenges in open-vocabulary semantic segmentation: 1. **Limitations of the CLIP Model**: Although the Contrastive Language-Image Pre-training (CLIP) model excels in recognizing visual concepts, it struggles to generate coherent image segmentation in dense prediction tasks due to its limited localization capabilities. 2. **Advantages of Vision Foundation Models**: Vision Foundation Models (VFMs) such as self-supervised methods and the Segment Anything Model (SAM) perform well in obtaining spatially consistent local visual representations but lack semantic understanding and usually require fine-tuning on downstream tasks. 3. **Combining the Strengths of CLIP and VFM**: The paper proposes an innovative framework named ProxyCLIP, which aims to merge the strengths of CLIP and VFMs to enhance the effectiveness of open-vocabulary semantic segmentation. Specifically, ProxyCLIP leverages the spatial feature correspondences of VFMs as proxy attention to enhance CLIP, thereby inheriting the robust local consistency of VFMs while maintaining the zero-shot transfer capability of CLIP. In this way, ProxyCLIP significantly improves the mean Intersection over Union (mIoU) on multiple benchmark datasets from 40.3 to 44.4 without retraining, demonstrating its excellent performance in bridging spatial precision and semantic richness in open-vocabulary segmentation tasks.

ProxyCLIP: Proxy Attention Improves CLIP for Open-Vocabulary Segmentation

CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation

ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference

Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation

SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation

Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP

TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation

Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation

CLIP-VIS: Adapting CLIP for Open-Vocabulary Video Instance Segmentation

Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation

Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP

SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding

CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation

Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation

TagCLIP: Improving Discrimination Ability of Zero-Shot Semantic Segmentation

Alpha-CLIP: A CLIP Model Focusing on Wherever You Want