Mixture-of-Prompt-Experts for Multi-modal Semantic Understanding

Zichen Wu,Hsiu-Yuan Huang,Fanyi Qu,Yunfang Wu
2024-03-24
Abstract:Deep multimodal semantic understanding that goes beyond the mere superficial content relation mining has received increasing attention in the realm of artificial intelligence. The challenges of collecting and annotating high-quality multi-modal data have underscored the significance of few-shot learning. In this paper, we focus on two critical tasks under this context: few-shot multi-modal sarcasm detection (MSD) and multi-modal sentiment analysis (MSA). To address them, we propose Mixture-of-Prompt-Experts with Block-Aware Prompt Fusion (MoPE-BAF), a novel multi-modal soft prompt framework based on the unified vision-language model (VLM). Specifically, we design three experts of soft prompts: a text prompt and an image prompt that extract modality-specific features to enrich the single-modal representation, and a unified prompt to assist multi-modal interaction. Additionally, we reorganize Transformer layers into several blocks and introduce cross-modal prompt attention between adjacent blocks, which smoothens the transition from single-modal representation to multi-modal fusion. On both MSD and MSA datasets in few-shot setting, our proposed model not only surpasses the 8.2B model InstructBLIP with merely 2% parameters (150M), but also significantly outperforms other widely-used prompt methods on VLMs or task-specific methods.
Computation and Language,Multimedia
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve The paper primarily aims to address two key tasks in Multi-modal Semantic Understanding (MSU): Few-shot Multi-modal Sarcasm Detection (MSD) and Few-shot Multi-modal Sentiment Analysis (MSA). Specifically: 1. **Challenges in Multi-modal Data Annotation**: - Collecting and annotating high-quality multi-modal data is very difficult, especially in the context of sarcasm expression. Therefore, researchers need to develop models that perform well with a small amount of data. 2. **Cross-modal Fusion Challenges**: - Multi-modal data includes both text and image information. Effectively integrating these different modalities to better understand complex semantic relationships is a significant challenge. To address these issues, the authors propose a new multi-modal soft prompt framework—a Mixture-of-Prompt-Experts with Block-Aware Prompt Fusion (MoPE-BAF) mechanism based on a Unified Vision-Language Model (VLM). This method designs different soft prompt experts to extract unimodal features and introduces a cross-modal prompt attention mechanism, thereby achieving a smooth transition from unimodal representation to multi-modal fusion. In summary, the paper aims to solve the tasks of multi-modal sarcasm detection and sentiment analysis in few-shot scenarios by proposing a novel multi-modal soft prompt framework, achieving significant performance improvements on multiple benchmark datasets.