MoPE: Parameter-Efficient and Scalable Multimodal Fusion via Mixture of Prompt Experts

Ruixiang Jiang,Lingbo Liu,Changwen Chen
2024-09-11
Abstract:Despite the demonstrated parameter efficiency of prompt-based multimodal fusion methods, their limited adaptivity and expressiveness often result in suboptimal performance compared to other tuning approaches. In this paper, we address these limitations by decomposing the vanilla prompts to adaptively capture instance-level features. Building upon this decomposition, we introduce the mixture of prompt experts (MoPE) technique to enhance the expressiveness of prompt tuning. MoPE leverages multimodal pairing priors to route the most effective prompt on a per-instance basis. Compared to vanilla prompting, our MoPE-based fusion method exhibits greater expressiveness, scaling more effectively with the training data and the overall number of trainable parameters. We also investigate regularization terms for expert routing, which lead to emergent expert specialization during training, paving the way for interpretable soft prompting. Extensive experiments across six multimodal datasets spanning four modalities demonstrate that our method achieves state-of-the-art results for prompt fusion, matching or even surpassing the performance of fine-tuning while requiring only 0.8% of the trainable parameters. Code will be released: <a class="link-external link-https" href="https://github.com/songrise/MoPE" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Artificial Intelligence,Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the limitations in adaptability and expressiveness of the existing prompt - based multimodal fusion methods. Specifically, although prompt - based multimodal fusion methods have demonstrated parameter efficiency, their limited adaptability and expressiveness usually lead to poorer performance compared to other tuning methods. The paper points out that the traditional globally - shared long prompt may not be optimal when dealing with each instance, especially when the amount of data is large or the task complexity is high, and this limitation is more obvious. To solve these problems, the author proposes a technique named MoPE (Mixture of Prompt Experts), which improves the adaptability and expressiveness of prompt tuning by decomposing the global prompt into multiple short and specialized prompt experts. MoPE utilizes multimodal pairing priors to select the most effective prompt for each instance. This method not only improves expressiveness but also can scale more effectively as the amount of training data and the number of trainable parameters increase. In addition, the author also studies regularization terms for expert routing, and these regularization terms contribute to the emergence of expert specialization during the training process, thereby achieving interpretable soft prompts. In summary, this paper aims to improve the adaptability and expressiveness of prompt - based multimodal fusion methods by introducing the MoPE technique, enabling them to reach or exceed the performance of fine - tuning methods on a variety of multimodal tasks while maintaining extremely high parameter efficiency.