Abstract:Large scale Vision-Language (VL) models have shown tremendous success in aligning representations between visual and text modalities. This enables remarkable progress in zero-shot recognition, image generation & editing, and many other exciting tasks. However, VL models tend to over-represent objects while paying much less attention to verbs, and require additional tuning on video data for best zero-shot action recognition performance. While previous work relied on large-scale, fully-annotated data, in this work we propose an unsupervised approach. We adapt a VL model for zero-shot and few-shot action recognition using a collection of unlabeled videos and an unpaired action dictionary. Based on that, we leverage Large Language Models and VL models to build a text bag for each unlabeled video via matching, text expansion and captioning. We use those bags in a Multiple Instance Learning setup to adapt an image-text backbone to video data. Although finetuned on unlabeled video data, our resulting models demonstrate high transferability to numerous unseen zero-shot downstream tasks, improving the base VL model performance by up to 14\%, and even comparing favorably to fully-supervised baselines in both zero-shot and few-shot video recognition transfer. The code will be released later at \url{<a class="link-external link-https" href="https://github.com/wlin-at/MAXI" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

The paper primarily addresses the limitations of Vision Language (VL) models in zero-shot action recognition tasks and proposes a new solution. Although existing large-scale VL models (such as CLIP) have achieved great success in aligning representations between image and text modalities and have made significant progress in various tasks such as zero-shot recognition, they tend to over-represent objects (nouns) and pay insufficient attention to actions (verbs or verb phrases) when dealing with video data. This results in poor performance when these models are applied to zero-shot action recognition tasks on video data without any fine-tuning. To address this issue, the paper proposes a method named "MAtch, eXpand and Improve" (MAXI). This method aims to improve the performance of VL models in zero-shot action recognition tasks by leveraging unannotated video data and a series of language resources. Specifically, MAXI achieves its goal through the following steps: 1. **Match**: Use existing VL models (e.g., CLIP) to match each unannotated video with entries in a predefined action dictionary to find the most relevant text descriptions. 2. **Expand**: Utilize large-scale language models (e.g., GPT-3) and image-text models (e.g., BLIP) to generate more text descriptions of the video content, thereby expanding the originally matched text descriptions. 3. **Improve**: Construct a "text package" corresponding to each video and fine-tune the VL model through a Multiple Instance Learning (MIL) strategy to improve its zero-shot recognition performance on unseen action categories. The key innovation of the MAXI method lies in its ability to fine-tune VL models through the above process without relying on any annotated data, thereby significantly enhancing the model's performance in multiple downstream zero-shot action recognition tasks. Experimental results show that MAXI not only significantly improves the performance of the source VL model (with an improvement of up to 14%) but also surpasses baseline models trained with fully supervised data in some zero-shot and few-shot action recognition tasks.

MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge

ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition.

UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding

Building an Open-Vocabulary Video CLIP Model With Better Architectures, Optimization and Data

LaFTer: Label-Free Tuning of Zero-shot Classifier using Language and Unlabeled Image Collections

Transductive Zero-Shot Action Recognition by Word-Vector Embedding

ZEETAD: Adapting Pretrained Vision-Language Model for Zero-Shot End-to-End Temporal Action Detection

Joint Embedding with Multi-Task Learning for Multi-Label Zero-Shot Action Recognition

Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model Via Interpolated Weight Optimization

Efficient Transfer Learning for Video-language Foundation Models

ActionCLIP: A New Paradigm for Video Action Recognition

Masked Unsupervised Self-training for Label-free Image Classification

Multi-label Zero-Shot Human Action Recognition Via Joint Latent Ranking Embedding

Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models

Zero-Shot Action Recognition in Surveillance Videos

GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection

Zero-shot Action Localization via the Confidence of Large Vision-Language Models

Learning Text-to-Video Retrieval from Image Captioning

Prompting Visual-Language Models for Efficient Video Understanding

Robust Fine-Tuning of Vision-Language Models for Domain Generalization

Zoom-shot: Fast and Efficient Unsupervised Zero-Shot Transfer of CLIP to Vision Encoders with Multimodal Loss