Mixture-of-Prompt-Experts for Multi-modal Semantic Understanding

Zichen Wu,Hsiu-Yuan Huang,Fanyi Qu,Yunfang Wu

2024-03-24

Abstract:Deep multimodal semantic understanding that goes beyond the mere superficial content relation mining has received increasing attention in the realm of artificial intelligence. The challenges of collecting and annotating high-quality multi-modal data have underscored the significance of few-shot learning. In this paper, we focus on two critical tasks under this context: few-shot multi-modal sarcasm detection (MSD) and multi-modal sentiment analysis (MSA). To address them, we propose Mixture-of-Prompt-Experts with Block-Aware Prompt Fusion (MoPE-BAF), a novel multi-modal soft prompt framework based on the unified vision-language model (VLM). Specifically, we design three experts of soft prompts: a text prompt and an image prompt that extract modality-specific features to enrich the single-modal representation, and a unified prompt to assist multi-modal interaction. Additionally, we reorganize Transformer layers into several blocks and introduce cross-modal prompt attention between adjacent blocks, which smoothens the transition from single-modal representation to multi-modal fusion. On both MSD and MSA datasets in few-shot setting, our proposed model not only surpasses the 8.2B model InstructBLIP with merely 2% parameters (150M), but also significantly outperforms other widely-used prompt methods on VLMs or task-specific methods.

Computation and Language,Multimedia

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve The paper primarily aims to address two key tasks in Multi-modal Semantic Understanding (MSU): Few-shot Multi-modal Sarcasm Detection (MSD) and Few-shot Multi-modal Sentiment Analysis (MSA). Specifically: 1. **Challenges in Multi-modal Data Annotation**: - Collecting and annotating high-quality multi-modal data is very difficult, especially in the context of sarcasm expression. Therefore, researchers need to develop models that perform well with a small amount of data. 2. **Cross-modal Fusion Challenges**: - Multi-modal data includes both text and image information. Effectively integrating these different modalities to better understand complex semantic relationships is a significant challenge. To address these issues, the authors propose a new multi-modal soft prompt framework—a Mixture-of-Prompt-Experts with Block-Aware Prompt Fusion (MoPE-BAF) mechanism based on a Unified Vision-Language Model (VLM). This method designs different soft prompt experts to extract unimodal features and introduces a cross-modal prompt attention mechanism, thereby achieving a smooth transition from unimodal representation to multi-modal fusion. In summary, the paper aims to solve the tasks of multi-modal sarcasm detection and sentiment analysis in few-shot scenarios by proposing a novel multi-modal soft prompt framework, achieving significant performance improvements on multiple benchmark datasets.

Mixture-of-Prompt-Experts for Multi-modal Semantic Understanding

Attention-optimized vision-enhanced prompt learning for few-shot multi-modal sentiment analysis

Syntax-aware Hybrid prompt model for Few-shot multi-modal sentiment analysis

Multi-Prompt with Depth Partitioned Cross-Modal Learning

Few-shot Joint Multimodal Aspect-Sentiment Analysis Based on Generative Multimodal Prompt

Few-shot Multimodal Sentiment Analysis based on Multimodal Probabilistic Fusion Prompts

POEM: Interactive Prompt Optimization for Enhancing Multimodal Reasoning of Large Language Models

Mutual Prompt Leaning for Vision Language Models

Mixture of Soft Prompts for Controllable Data Generation

Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want

Modality-invariant and Specific Prompting for Multimodal Human Perception Understanding

A Unified Framework for Multi-intent Spoken Language Understanding with prompting

X-Prompt: Multi-modal Visual Prompt for Video Object Segmentation

Multi-Head Mixture-of-Experts

Conditional Prompt Tuning for Multimodal Fusion

MuAP: Multi-step Adaptive Prompt Learning for Vision-Language Model with Missing Modality

MoBA: Mixture of Bi-directional Adapter for Multi-modal Sarcasm Detection

Mixture of Prompt Learning for Vision Language Models

APLe: Token-Wise Adaptive for Multi-Modal Prompt Learning

Multi-Prompting Decoder Helps Better Language Understanding

A Unified Visual Prompt Tuning Framework with Mixture-of-Experts for Multimodal Information Extraction.