CoPL:Parameter-Efficient Collaborative Prompt Learning for Audio-Visual Tasks

Yihan Zhao,Wei Xi,Yuhang Cui,Gairui Bai,Xinhui Liu,Jizhong Zhao
DOI: https://doi.org/10.1145/3664647.3681492
2024-01-01
Abstract:Parameter-Efficient Fine Tuning (PEFT) has been demonstrated to be effective and efficient for transferring foundation models to downstream tasks. Transferring pretrained uni-modal models to multi-modal downstream tasks helps alleviate substantial computational costs for retraining multi-modal models. However, existing approaches primarily focus on multi-modal fusion, while neglecting the modal-specific fine-tuning, which is also crucial for multi-modal tasks. To this end, we propose parameter-efficient Collaborative Prompt Learning (CoPL) to fine-tune both uni-modal and multi-modal features. Specifically, the collaborative prompts consist of modal-specific prompts and modal-interaction prompts. The modal-specific prompts are tailored for fine-tuning each modality, while the modal-interaction prompts are customized to explore inter-modality association. Furthermore, prompt bank-based mutual coupling is introduced to extract instance-level features, further enhancing the model's generalization ability. Extensive experimental results demonstrate that our approach achieves comparable or higher performance on various audio-visual downstream tasks while utilizing approximately 1% extra trainable parameters.
What problem does this paper attempt to address?