Multimodal dynamic fusion framework: Multilevel feature fusion guided by prompts

Lei Pan,Huan‐Qing Wu
DOI: https://doi.org/10.1111/exsy.13668
IF: 3.3
2024-07-13
Expert Systems
Abstract:With the progressive augmentation of parameters in multimodal models, to optimize computational efficiency, some studies have adopted the approach of fine‐tuning the unimodal pre‐training model to achieve multimodal fusion tasks. However, these methods tend to rely solely on simplistic or singular fusion strategies, thereby neglecting more flexible fusion approaches. Moreover, existing methods prioritize the integration of modality features containing highly semantic information, often overlooking the influence of fusing low‐level features on the outcomes. Therefore, this study introduces an innovative approach named multilevel feature fusion guided by prompts (MFF‐GP), a multimodal dynamic fusion framework. It guides the dynamic neural network by prompt vectors to dynamically select the suitable fusion network for each hierarchical feature of the unimodal pre‐training model. This method improves the interactions between multiple modalities and promotes a more efficient fusion of features across them. Extensive experiments on the UPMC Food 101, SNLI‐VE and MM‐IMDB datasets demonstrate that with only a few trainable parameters, MFF‐GP achieves significant accuracy improvements compared to a newly designed PMF based on fine‐tuning—specifically, an accuracy improvement of 2.15% on the UPMC Food 101 dataset and 0.82% on the SNLI‐VE dataset. Further study of the results reveals that increasing the diversity of interactions between distinct modalities is critical and delivers significant performance improvements. Furthermore, for certain multimodal tasks, focusing on the low‐level features is beneficial for modality integration. Our implementation is available at: https://github.com/whq2024/MFF-GP.
computer science, artificial intelligence, theory & methods
What problem does this paper attempt to address?