MoPE: Parameter-Efficient and Scalable Multimodal Fusion via Mixture of Prompt Experts

Ruixiang Jiang,Lingbo Liu,Changwen Chen

2024-09-11

Abstract:Despite the demonstrated parameter efficiency of prompt-based multimodal fusion methods, their limited adaptivity and expressiveness often result in suboptimal performance compared to other tuning approaches. In this paper, we address these limitations by decomposing the vanilla prompts to adaptively capture instance-level features. Building upon this decomposition, we introduce the mixture of prompt experts (MoPE) technique to enhance the expressiveness of prompt tuning. MoPE leverages multimodal pairing priors to route the most effective prompt on a per-instance basis. Compared to vanilla prompting, our MoPE-based fusion method exhibits greater expressiveness, scaling more effectively with the training data and the overall number of trainable parameters. We also investigate regularization terms for expert routing, which lead to emergent expert specialization during training, paving the way for interpretable soft prompting. Extensive experiments across six multimodal datasets spanning four modalities demonstrate that our method achieves state-of-the-art results for prompt fusion, matching or even surpassing the performance of fine-tuning while requiring only 0.8% of the trainable parameters. Code will be released: <a class="link-external link-https" href="https://github.com/songrise/MoPE" rel="external noopener nofollow">this https URL</a>.

Machine Learning,Artificial Intelligence,Computation and Language,Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the limitations in adaptability and expressiveness of the existing prompt - based multimodal fusion methods. Specifically, although prompt - based multimodal fusion methods have demonstrated parameter efficiency, their limited adaptability and expressiveness usually lead to poorer performance compared to other tuning methods. The paper points out that the traditional globally - shared long prompt may not be optimal when dealing with each instance, especially when the amount of data is large or the task complexity is high, and this limitation is more obvious. To solve these problems, the author proposes a technique named MoPE (Mixture of Prompt Experts), which improves the adaptability and expressiveness of prompt tuning by decomposing the global prompt into multiple short and specialized prompt experts. MoPE utilizes multimodal pairing priors to select the most effective prompt for each instance. This method not only improves expressiveness but also can scale more effectively as the amount of training data and the number of trainable parameters increase. In addition, the author also studies regularization terms for expert routing, and these regularization terms contribute to the emergence of expert specialization during the training process, thereby achieving interpretable soft prompts. In summary, this paper aims to improve the adaptability and expressiveness of prompt - based multimodal fusion methods by introducing the MoPE technique, enabling them to reach or exceed the performance of fine - tuning methods on a variety of multimodal tasks while maintaining extremely high parameter efficiency.

MoPE: Parameter-Efficient and Scalable Multimodal Fusion via Mixture of Prompt Experts

Conditional Prompt Tuning for Multimodal Fusion

Modular and Parameter-Efficient Multimodal Fusion with Prompting

EPE-P: Evidence-based Parameter-efficient Prompting for Multimodal Learning with Missing Modalities

CoPL:Parameter-Efficient Collaborative Prompt Learning for Audio-Visual Tasks

Efficient Multimodal Fusion Via Interactive Prompting

Multimodal dynamic fusion framework: Multilevel feature fusion guided by prompts

Mixture-of-Prompt-Experts for Multi-modal Semantic Understanding

Efficient Prompt Tuning by Multi-Space Projection and Prompt Fusion

SDPT: Synchronous Dual Prompt Tuning for Fusion-based Visual-Language Pre-trained Models

Parameter-efficient Tuning of Large-scale Multimodal Foundation Model

PERFT: Parameter-Efficient Routed Fine-Tuning for Mixture-of-Expert Model

Towards Being Parameter-Efficient: A Stratified Sparsely Activated Transformer with Dynamic Capacity

Prompt Link Multimodal Fusion in Multimodal Sentiment Analysis

MuDPT: Multi-modal Deep-symphysis Prompt Tuning for Large Pre-trained Vision-Language Models

A Unified Visual Prompt Tuning Framework with Mixture-of-Experts for Multimodal Information Extraction.

Prompt Fusion Interaction Transformer for Aspect-Based Multimodal Sentiment Analysis

BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts

MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts

M^2PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning

Parameter Efficient Multi-task Fine-tuning by Learning to Transfer Token-wise Prompts