FROSTER: Frozen CLIP Is A Strong Teacher for Open-Vocabulary Action Recognition

Xiaohu Huang,Hao Zhou,Kun Yao,Kai Han
2024-02-06
Abstract:In this paper, we introduce FROSTER, an effective framework for open-vocabulary action recognition. The CLIP model has achieved remarkable success in a range of image-based tasks, benefiting from its strong generalization capability stemming from pretaining on massive image-text pairs. However, applying CLIP directly to the open-vocabulary action recognition task is challenging due to the absence of temporal information in CLIP's pretraining. Further, fine-tuning CLIP on action recognition datasets may lead to overfitting and hinder its generalizability, resulting in unsatisfactory results when dealing with unseen actions.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?