Alignment and Generation Adapter for Efficient Video-text Understanding

Han Fang,Zhifei Yang,Yuhan Wei,Xianghao Zang,Chao Ban,Zerun Feng,Zhongjiang He,Yongxiang Li,Hao Sun
DOI: https://doi.org/10.1109/iccvw60793.2023.00296
2023-01-01
Abstract:Pre-trained models have demonstrated considerable performance, especially in enhancing cross-modal understanding between videos and text. However, fine-tuning them at scale becomes costly and poses challenges for adapting to various downstream tasks. To tackle these challenges, we propose the Alignment-generation Adapter (AGAdapter), establishing semantic coherence between alignment and generation models for efficient video-text adaptation across multiple tasks simultaneously. We propose an alignment adapter with knowledge-sharing to adapt the frozen CLIP model for fine-grained video-language interaction. Additionally, we introduce the generation adapter with prompt tuning to leverage the large language model for captioning. Furthermore, we introduce instruction joint tuning, combining textual and cross-modal instructions, to capture detailed descriptions. Our AGAdapter achieves state-of-the-art performance on video-text retrieval and video captioning tasks, including two benchmarks, MSR-VTT and ActivityNet.
What problem does this paper attempt to address?