Adaptively Building a Video-language Model for Video Captioning and Retrieval Without Massive Video Pretraining

Zihao Liu,Xiaoyu Wu,Shengjin Wang,Jiayao Qian
DOI: https://doi.org/10.1145/3664647.3680778
2024-01-01
Abstract:Large-scale pretrained image-language models have shown remarkable performance recently. However, building a video-language model is more challenging due to the complexity of video and the difficulty of collecting high-quality data. This paper builds a video-language model in an adaptive manner, which transfers the knowledge from the image domain and can achieve state-of-the-art performance without any further massive video pretraining. The main contributions include a Visual Perception Adapter that seamlessly and efficiently adapts a pretrained image-language model to the video domain and a fine-grained contrastive learning with Inter-modal Token Alignment that bridges semantic gaps between vision, audio, and language with less data. The proposed model is evaluated on video captioning and retrieval. Experiments demonstrate that the proposed model exhibits competitive performance compared to models pretrained on millions of video-text pairs. Notably, our model's CIDEr and R@1 scores on the MSR-VTT dataset exceed the existing state-of-the-art by 6.3% and 1.3%.
What problem does this paper attempt to address?