CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning

Huaishao Luo,Lei Ji,Ming Zhong,Yang Chen,Wen Lei,Nan Duan,Tianrui Li
DOI: https://doi.org/10.1016/j.neucom.2022.07.028
IF: 6
2022-10-07
Neurocomputing
Abstract:Video clip retrieval and captioning tasks play an essential role in multimodal research and are the fundamental research problem for multimodal understanding and generation. The CLIP (Contrastive Language-Image Pre-training) model has demonstrated the power of visual concepts learning from web collected image-text datasets. In this paper, we propose a CLIP4Clip model to transfer the knowledge of the image-text pretrained CLIP model to video-text tasks in an end-to-end manner. Furthermore, we conduct several empirical studies including 1) Whether image feature is enough for video-text retrieval and captioning? 2) How a post-pretraining on a large-scale video-text dataset based on the CLIP affect the performance? 3) What is the practical mechanism to model temporal dependency between video frames? And 4) The Hyper-parameters sensitivity of the model. Extensive experimental results present that the CLIP4Clip model transferred from the CLIP can achieve SOTA results on various video-text datasets, including MSR-VTT, MSVD, LSMDC, and DiDeMo for multimodal understanding and generation tasks.
computer science, artificial intelligence
What problem does this paper attempt to address?