Abstract:Open-vocabulary segmentation is a challenging task requiring segmenting and recognizing objects from an open set of categories. One way to address this challenge is to leverage multi-modal models, such as CLIP, to provide image and text features in a shared embedding space, which bridges the gap between closed-vocabulary and open-vocabulary recognition. Hence, existing methods often adopt a two-stage framework to tackle the problem, where the inputs first go through a mask generator and then through the CLIP model along with the predicted masks. This process involves extracting features from images multiple times, which can be ineffective and inefficient. By contrast, we propose to build everything into a single-stage framework using a shared Frozen Convolutional CLIP backbone, which not only significantly simplifies the current two-stage pipeline, but also remarkably yields a better accuracy-cost trade-off. The proposed FC-CLIP, benefits from the following observations: the frozen CLIP backbone maintains the ability of open-vocabulary classification and can also serve as a strong mask generator, and the convolutional CLIP generalizes well to a larger input resolution than the one used during contrastive image-text pretraining. When training on COCO panoptic data only and testing in a zero-shot manner, FC-CLIP achieve 26.8 PQ, 16.8 AP, and 34.1 mIoU on ADE20K, 18.2 PQ, 27.9 mIoU on Mapillary Vistas, 44.0 PQ, 26.8 AP, 56.2 mIoU on Cityscapes, outperforming the prior art by +4.2 PQ, +2.4 AP, +4.2 mIoU on ADE20K, +4.0 PQ on Mapillary Vistas and +20.1 PQ on Cityscapes, respectively. Additionally, the training and testing time of FC-CLIP is 7.5x and 6.6x significantly faster than the same prior art, while using 5.9x fewer parameters. FC-CLIP also sets a new state-of-the-art performance across various open-vocabulary semantic segmentation datasets. Code at <a class="link-external link-https" href="https://github.com/bytedance/fc-clip" rel="external noopener nofollow">this https URL</a>

FROSTER: Frozen CLIP Is A Strong Teacher for Open-Vocabulary Action Recognition

ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition.

Frozen CLIP Models are Efficient Video Learners

Revisiting the Spatial and Temporal Modeling for Few-shot Action Recognition

Knowledge-guided Pre-Training and Fine-Tuning: Video Representation Learning for Action Recognition

Rethinking CLIP-based Video Learners in Cross-Domain Open-Vocabulary Action Recognition

Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP

ActionCLIP: A New Paradigm for Video Action Recognition

Building an Open-Vocabulary Video CLIP Model With Better Architectures, Optimization and Data

M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action Recognition

FrozenSeg: Harmonizing Frozen Foundation Models for Open-Vocabulary Segmentation

Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model Via Interpolated Weight Optimization

Adapting CLIP for Action Recognition via Dual Semantic Supervision and Temporal Prompt Reparameterization

CLIPER: A Unified Vision-Language Framework for In-the-Wild Facial Expression Recognition

Robotic-CLIP: Fine-tuning CLIP on Action Data for Robotic Applications

OmniCLIP: Adapting CLIP for Video Recognition with Spatial-Temporal Omni-Scale Feature Learning

CLIP-guided Prototype Modulating for Few-shot Action Recognition

MA-FSAR: Multimodal Adaptation of CLIP for Few-Shot Action Recognition

Open-Set Video-based Facial Expression Recognition with Human Expression-sensitive Prompting

Fine-grained Knowledge Graph-driven Video-Language Learning for Action Recognition

FineCLIPER: Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs