Abstract: Large-scale image-language pretrained models, e.g., CLIP, have demonstrated remarkable proficiency in acquiring general multi-modal knowledge through web-scale image-text data. Despite the impressive performance of image-language models on various image tasks, how to effectively expand them on general video understanding remains an area of ongoing exploration. In this paper, we investigate the image-to-video transferring from the perspective of the model and the data, unveiling two key obstacles impeding the adaptation of image-language models: non-generalizable temporal modeling and partially misaligned video-text data. To address these challenges, we propose Spatial-Temporal Auxiliary Network with Mutual-guided alignment module (Mug-STAN), a simple yet effective framework extending image-text model to diverse video tasks and video-text data.Specifically, STAN adopts a branch structure with decomposed spatial-temporal modules to enable generalizable temporal modeling, while Mug suppresses misalignment by introducing token-wise feature aggregation of either modality from the other. Extensive experimental results verify Mug-STAN significantly improves adaptation of language-image pretrained models such as CLIP and CoCa at both video-text post-pretraining and finetuning stages. With our solution, state-of-the-art zero-shot and finetuning results on various downstream datasets, including MSR-VTT, DiDeMo, LSMDC, Kinetics-400, Something-Something-2, HMDB-51, UCF- 101, and AVA, are achieved. Moreover, by integrating pretrained Mug-STAN with the emerging multimodal dialogue model, we can realize zero-shot video chatting. Codes are available at https://github.com/farewellthree/STAN

Transferring Image-CLIP to Video-Text Retrieval via Temporal Relations

Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring.

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Alignment

CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning

TDS-CLIP: Temporal Difference Side Network for Image-to-Video Transfer Learning

VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending

VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

Stacked Convolutional Deep Encoding Network for Video-Text Retrieval.

BiC-Net: Learning Efficient Spatio-Temporal Relation for Text-Video Retrieval

Text-Video Retrieval with Global-Local Semantic Consistent Learning

Temporal Perceiving Video-Language Pre-training

SpaceCLIP: A Vision-Language Pretraining Framework With Spatial Reconstruction On Text

Mug-STAN: Adapting Image-Language Pretrained Models for General Video Understanding.

Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval

Adapting CLIP for Action Recognition via Dual Semantic Supervision and Temporal Prompt Reparameterization

Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning

Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning

MV-Adapter: Multimodal Video Transfer Learning for Video Text Retrieval

Improving CLIP Training with Language Rewrites