MPT4LM: Multi-Modal Prompt Tuning Makes Pre-Trained Large Language Models Better Vision-Language Learners

Yongzhu Miao,Jintao Tang,Shasha Li,Ting Wang
DOI: https://doi.org/10.3233/faia240515
2024-01-01
Abstract:Pre-trained Large Language Models (LLMs) have demonstrated prominent generalization to various linguistic tasks. However, due to the inherent modality and task discrepancy, parameter-efficient transfer learning for adapting LLMs to vision-language (VL) tasks remains challenging, which may struggle with excessive extra computation and data expenditure for VL pre-training and disconnection between multi-modal representations. This paper concentrates on the parameter-efficient adaptation of LLMs to VL tasks without inflexible multi-modal alignment pre-training on additional image-text pairs. Inspired by Instruction Tuning and the nature of multi-modal representation learning, we propose Multi-modal Prompt Tuning for Language Models (MPT4LM). This method provides text-relevant visual prompts via a plug-and-play Cross-Attention module and integrates them with textual Learnable Instruction as multi-modal prompts into LLMs. We further assemble MPT4LM with the currently prevalent Adapter approach to alleviate the trainable parameter scale and facilitate the collaboration of multi-modal prompts. We evaluate MPT4LM upon two representative LLMs: LLAMA-2 and Flan-T5, over two VL tasks: Visual Question Answering (VQAv2.0, GQA) and Visual Entailment (SNLI-VE). Extensive experimental results reveal that MPT4LM achieves state-of-the-art performance among prompting methods with only fine-tuning about 0.65% of the parameters of backbones, indicating a better trade-off between computation and data overhead and model performance. Our code is available at: https://github.com/YzM1a0/MPT4LM.
What problem does this paper attempt to address?