Abstract:Open-vocabulary segmentation is a challenging task requiring segmenting and recognizing objects from an open set of categories. One way to address this challenge is to leverage multi-modal models, such as CLIP, to provide image and text features in a shared embedding space, which bridges the gap between closed-vocabulary and open-vocabulary recognition. Hence, existing methods often adopt a two-stage framework to tackle the problem, where the inputs first go through a mask generator and then through the CLIP model along with the predicted masks. This process involves extracting features from images multiple times, which can be ineffective and inefficient. By contrast, we propose to build everything into a single-stage framework using a shared Frozen Convolutional CLIP backbone, which not only significantly simplifies the current two-stage pipeline, but also remarkably yields a better accuracy-cost trade-off. The proposed FC-CLIP, benefits from the following observations: the frozen CLIP backbone maintains the ability of open-vocabulary classification and can also serve as a strong mask generator, and the convolutional CLIP generalizes well to a larger input resolution than the one used during contrastive image-text pretraining. When training on COCO panoptic data only and testing in a zero-shot manner, FC-CLIP achieve 26.8 PQ, 16.8 AP, and 34.1 mIoU on ADE20K, 18.2 PQ, 27.9 mIoU on Mapillary Vistas, 44.0 PQ, 26.8 AP, 56.2 mIoU on Cityscapes, outperforming the prior art by +4.2 PQ, +2.4 AP, +4.2 mIoU on ADE20K, +4.0 PQ on Mapillary Vistas and +20.1 PQ on Cityscapes, respectively. Additionally, the training and testing time of FC-CLIP is 7.5x and 6.6x significantly faster than the same prior art, while using 5.9x fewer parameters. FC-CLIP also sets a new state-of-the-art performance across various open-vocabulary semantic segmentation datasets. Code at <a class="link-external link-https" href="https://github.com/bytedance/fc-clip" rel="external noopener nofollow">this https URL</a>

Rethinking Prior Information Generation with CLIP for Few-Shot Segmentation

Visual and Textual Prior Guided Mask Assemble for Few-Shot Segmentation and Beyond

Iterative Few-shot Semantic Segmentation from Image Label Text

CLIP-Driven Prototype Network for Few-Shot Semantic Segmentation

CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation

SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation

TagCLIP: Improving Discrimination Ability of Zero-Shot Semantic Segmentation

SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference

TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation

Enhancing Few-Shot CLIP With Semantic-Aware Fine-Tuning

ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation

ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference

Exploring Regional Clues in CLIP for Zero-Shot Semantic Segmentation

Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior Refinement

A Joint Framework Towards Class-aware and Class-agnostic Alignment for Few-shot Segmentation

ReCLIP++: Learn to Rectify the Bias of CLIP for Unsupervised Semantic Segmentation

Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP

Transferring CLIP's Knowledge into Zero-Shot Point Cloud Semantic Segmentation

Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation