Abstract:Despite significant results achieved by Contrastive Language-Image Pretraining (CLIP) in zero-shot image recognition, limited effort has been made exploring its potential for zero-shot video recognition. This paper presents Open-VCLIP++, a simple yet effective framework that adapts CLIP to a strong zero-shot video classifier, capable of identifying novel actions and events during testing. Open-VCLIP++ minimally modifies CLIP to capture spatial-temporal relationships in videos, thereby creating a specialized video classifier while striving for generalization. We formally demonstrate that training Open-VCLIP++ is tantamount to continual learning with zero historical data. To address this problem, we introduce Interpolated Weight Optimization, a technique that leverages the advantages of weight interpolation during both training and testing. Furthermore, we build upon large language models to produce fine-grained video descriptions. These detailed descriptions are further aligned with video features, facilitating a better transfer of CLIP to the video domain. Our approach is evaluated on three widely used action recognition datasets, following a variety of zero-shot evaluation protocols. The results demonstrate that our method surpasses existing state-of-the-art techniques by significant margins. Specifically, we achieve zero-shot accuracy scores of 88.1%, 58.7%, and 81.2% on UCF, HMDB, and Kinetics-600 datasets respectively, outpacing the best-performing alternative methods by 8.5%, 8.2%, and 12.3%. We also evaluate our approach on the MSR-VTT video-text retrieval dataset, where it delivers competitive video-to-text and text-to-video retrieval performance, while utilizing substantially less fine-tuning data compared to other methods. Code is released at https://github.com/wengzejia1/Open-VCLIP.

GBC: Guided Alignment and Adaptive Boosting CLIP Bridging Vision and Language for Robust Action Recognition

ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition.

M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action Recognition

ActionCLIP: A New Paradigm for Video Action Recognition

EPK-CLIP: External and Priori Knowledge CLIP for action recognition

B2C-AFM: Bi-Directional Co-Temporal and Cross-Spatial Attention Fusion Model for Human Action Recognition.

Adapting CLIP for Action Recognition via Dual Semantic Supervision and Temporal Prompt Reparameterization

Rethinking CLIP-based Video Learners in Cross-Domain Open-Vocabulary Action Recognition

Robotic-CLIP: Fine-tuning CLIP on Action Data for Robotic Applications

RCAT: Retentive CLIP Adapter Tuning for Improved Video Recognition

Iclip: Bridging Image Classification and Contrastive Language-Image Pre-Training for Visual Recognition

ActivityCLIP: Enhancing Group Activity Recognition by Mining Complementary Information from Text to Supplement Image Modality

Fine-grained Knowledge Graph-driven Video-Language Learning for Action Recognition

Cross-modal learning with multi-modal model for video action recognition based on adaptive weight training

Building an Open-Vocabulary Video CLIP Model With Better Architectures, Optimization and Data

ProtoCLIP: Prototypical Contrastive Language Image Pretraining

Skeleton-based Action Recognition via Adaptive Cross-Form Learning

Neighbor-Guided Consistent and Contrastive Learning for Semi-Supervised Action Recognition

PB-GCN: Progressive binary graph convolutional networks for skeleton-based action recognition

GCF-Net: Gated Clip Fusion Network for Video Action Recognition

Video + CLIP Baseline for Ego4D Long-term Action Anticipation