MPT: Multi-grained Prompt Tuning for Text-Video Retrieval
Haonan Zhang,Pengpeng Zeng,Lianli Gao,Jingkuan Song,Heng Tao Shen
DOI: https://doi.org/10.1145/3664647.3680839
2024-01-01
Abstract:Recently, significant advancements have been made in supporting text-video retrieval by transferring large-scale image-text pre-training models through model adaptation, i.e., full fine-tuning, or prompt tuning, a parameter-efficient fine-tuning strategy. While full fine-tuning involves high computational costs, particularly with increasing model size, prompt tuning offers greater flexibility and efficiency by adjusting only a few learnable parameters. However, current prompt tuning methods rely on coarse visual and textual cues for text-video retrieval task, neglecting the domain-specific features when performing the adaptation. This approach may lead to sub-optimal performance due to the incorporation of irrelevant and indiscriminate knowledge. To address such an issue, we present a Multi-grained Prompt Tuning (MPT) for text-video retrieval, that designs a variety of specific prompts to effectively explore semantic interaction across different modalities with diverse granularity. Specifically, we devise a multi-grained video encoder that employs spatial, temporal, and global prompts to transfer the base-generic knowledge from the image-text pre-trained model while comprehensively excavating determinative video-specific characteristics. Meanwhile, we introduce a novel multi-grained text encoder aimed at capturing various levels of textual clues through the utilization of word and phrase prompts. Extensive experiments on four benchmark datasets, i.e., MSR-VTT, ActivityNet, DiDeMo, and LSMDC, demonstrate that MPT achieves outstanding performance, surpassing state-of-the-art methods with negligible computational cost. The codebase is publicly available at: https://github.com/zchoi/MPT.