Abstract:With the progressive augmentation of parameters in multimodal models, to optimize computational efficiency, some studies have adopted the approach of fine‐tuning the unimodal pre‐training model to achieve multimodal fusion tasks. However, these methods tend to rely solely on simplistic or singular fusion strategies, thereby neglecting more flexible fusion approaches. Moreover, existing methods prioritize the integration of modality features containing highly semantic information, often overlooking the influence of fusing low‐level features on the outcomes. Therefore, this study introduces an innovative approach named multilevel feature fusion guided by prompts (MFF‐GP), a multimodal dynamic fusion framework. It guides the dynamic neural network by prompt vectors to dynamically select the suitable fusion network for each hierarchical feature of the unimodal pre‐training model. This method improves the interactions between multiple modalities and promotes a more efficient fusion of features across them. Extensive experiments on the UPMC Food 101, SNLI‐VE and MM‐IMDB datasets demonstrate that with only a few trainable parameters, MFF‐GP achieves significant accuracy improvements compared to a newly designed PMF based on fine‐tuning—specifically, an accuracy improvement of 2.15% on the UPMC Food 101 dataset and 0.82% on the SNLI‐VE dataset. Further study of the results reveals that increasing the diversity of interactions between distinct modalities is critical and delivers significant performance improvements. Furthermore, for certain multimodal tasks, focusing on the low‐level features is beneficial for modality integration. Our implementation is available at: https://github.com/whq2024/MFF-GP.

Parameter-efficient Tuning of Large-scale Multimodal Foundation Model

Conditional Prompt Tuning for Multimodal Fusion

Prompt Tuning for Unified Multimodal Pretrained Models.

$π$-Tuning: Transferring Multimodal Foundation Models with Optimal Multi-task Interpolation

Multimodal Infusion Tuning for Large Models

Multitask Prompt Tuning Enables Parameter-Efficient Transfer Learning

M^2PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning

Towards a Unified View of Parameter-Efficient Transfer Learning

M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning

MuDPT: Multi-modal Deep-symphysis Prompt Tuning for Large Pre-trained Vision-Language Models

Efficient Multimodal Fusion Via Interactive Prompting

Prompt Tuning for Generative Multimodal Pretrained Models

UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling

Parameter Efficient Multi-task Fine-tuning by Learning to Transfer Token-wise Prompts

Parameter-efficient Weight Ensembling Facilitates Task-level Knowledge Transfer.

Parameter-Efficient Cross-lingual Transfer of Vision and Language Models via Translation-based Alignment

Multimodal dynamic fusion framework: Multilevel feature fusion guided by prompts

Multimodal Instruction Tuning with Hybrid State Space Models

On Transferability of Prompt Tuning for Natural Language Processing

APrompt: Attention Prompt Tuning for Efficient Adaptation of Pre-trained Language Models

Exploring the Transferability of Visual Prompting for Multimodal Large Language Models