MuDPT: Multi-modal Deep-symphysis Prompt Tuning for Large Pre-trained Vision-Language Models

Yongzhu Miao,Shasha Li,Jintao Tang,Ting Wang
DOI: https://doi.org/10.1109/ICME55011.2023.00013
2024-07-14
Abstract:Prompt tuning, like CoOp, has recently shown promising vision recognizing and transfer learning ability on various downstream tasks with the emergence of large pre-trained vision-language models like CLIP. However, we identify that existing uni-modal prompt tuning approaches may result in sub-optimal performance since this uni-modal design breaks the original alignment of textual and visual representations in the pre-trained model. Inspired by the nature of pre-trained vision-language models, we aim to achieve completeness in prompt tuning and propose a novel approach called Multi-modal Deep-symphysis Prompt Tuning, dubbed as MuDPT, which extends independent multi-modal prompt tuning by additionally learning a model-agnostic transformative network to allow deep hierarchical bi-directional prompt fusion. We evaluate the effectiveness of MuDPT on few-shot vision recognition and out-of-domain generalization tasks. Compared with the state-of-the-art methods, MuDPT achieves better recognition and generalization ability with an apparent margin thanks to synergistic alignment of textual and visual representations. Our code is available at: <a class="link-external link-https" href="https://github.com/Mechrev0/MuDPT" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that when existing unimodal prompt tuning methods are used to handle large - scale pre - trained vision - language models (such as CLIP), they break the original alignment state between text and visual representations, resulting in sub - optimal performance in downstream tasks. Specifically: 1. **Limitations of Unimodal Prompt Tuning**: Existing unimodal prompt tuning methods (for example, only adjusting the text or image branch) fail to fully utilize the co - alignment of text and visual representations already established in the pre - trained model. This unimodal design will disrupt the original multimodal alignment, thus affecting the performance of downstream tasks. 2. **Need for Cross - Modal Fusion**: In order to better adapt to visual recognition tasks, a method that can adjust text and visual prompts simultaneously is required to maintain and enhance the consistency and synergy of multimodal representations. To solve the above problems, the authors propose the **Multi - modal Deep - symphysis Prompt Tuning (MuDPT)** method. MuDPT achieves a deep - level bidirectional fusion between text and visual prompts by introducing a lightweight modality - independent transformation network (Injection Model). This not only preserves the existing alignment relationships in the pre - trained model but also further enhances the synergy between text and visual representations, thereby improving performance in tasks such as few - shot visual recognition and out - of - domain generalization. ### Specific Improvement Points - **Multimodal Prompt Fusion**: MuDPT introduces text and visual prompts and realizes the bidirectional fusion of cross - modal prompts through the Injection Model. - **Deep - level Bidirectional Fusion**: It not only introduces learnable prompts at the embedding layer but also gradually models the phased text and visual representations in deeper Transformer layers. - **Cross - Modal Attention Mechanism**: It calculates cross - modal attention through a multi - head attention block and adjusts the dimension of the prompts through a linear layer to ensure effective interaction between different modalities. ### Experimental Results The experimental results show that MuDPT significantly outperforms existing unimodal prompt tuning methods on multiple benchmark datasets, especially performing excellently in few - shot visual recognition and out - of - domain generalization tasks. Specifically: - In the few - shot visual recognition tasks of 11 datasets, the average accuracy of MuDPT is 8.2% higher than that of CoOp and 6.31% higher than that of CoCoOp. - In the out - of - domain generalization tasks, MuDPT also shows better generalization ability, especially more prominent in new categories. In general, through multimodal deep - symphysis prompt tuning, MuDPT effectively solves the limitations of existing unimodal prompt tuning methods and improves the performance and generalization ability of visual recognition tasks.