Abstract:Advances in multimodal models have greatly improved how interactions relevant to various tasks are modeled. Today's multimodal models mainly focus on the correspondence between images and text, using this for tasks like image-text matching. However, this covers only a subset of real-world interactions. Novel interactions, such as sarcasm expressed through opposing spoken words and gestures or humor expressed through utterances and tone of voice, remain challenging. In this paper, we introduce an approach to enhance multimodal models, which we call Multimodal Mixtures of Experts (MMoE). The key idea in MMoE is to train separate expert models for each type of multimodal interaction, such as redundancy present in both modalities, uniqueness in one modality, or synergy that emerges when both modalities are fused. On a sarcasm detection task (MUStARD) and a humor detection task (URFUNNY), we obtain new state-of-the-art results. MMoE is also able to be applied to various types of models to gain improvement.

What problem does this paper attempt to address?

### The Problem Addressed by the Paper This paper attempts to address the limitations of current multimodal models in handling different types of multimodal interactions. Specifically, existing multimodal models mainly focus on the correspondence between images and text, suitable for tasks such as image-text matching. However, these models perform poorly when dealing with more complex multimodal interactions, such as sarcasm expressed through contradictory speech and gestures, or humor conveyed through voice and tone. To solve this problem, the authors propose a method called **Multimodal Mixtures of Experts (MM OE)**. The core idea of this method is to train specialized expert models for each type of multimodal interaction, including redundant information, unique information, and collaborative information. In this way, MM OE can better capture and handle different types of multimodal interactions, thereby improving performance in complex tasks. ### Main Contributions 1. **Proposed MM OE Method**: By training multiple specialized expert models to handle different types of multimodal interactions, the model's performance in complex tasks is improved. 2. **New SOTA Results**: Achieved new state-of-the-art results on two multimodal datasets (MUStARD and URFunny). 3. **Broad Applicability**: MM OE can be applied to various types of models, including Vision-Language Models (VLM), Multimodal Large Language Models (MLLM), and Large Language Models for image description (LLM), and shows improvements across these models. ### Experimental Results - **Overall Comparison**: On the MUStARD and URFunny datasets, MM OE outperformed existing state-of-the-art models. Specifically, on the MUStARD dataset, the F1 score increased by 1.35 points; on the URFunny dataset, the accuracy increased by 0.84 points. - **Improvements on Different Models**: Applying MM OE to ALBEF, BLIP2, or Qwen2 significantly improved performance. Notably, Qwen2-1.5B achieved a 6.96-point increase in F1 score on the MUStARD dataset, becoming the new state-of-the-art model for this task. ### Analysis - **Limitations of Current Models**: Existing models perform poorly in handling collaborative information, while their performance on redundant and unique information is relatively better. This indicates that collaborative information is a major challenge in multimodal interactions. - **Performance of Expert Models**: Expert models specifically targeting collaborative and redundant information showed significant performance improvements on corresponding types of interaction data. This validates the effectiveness of the MM OE method. - **Scale of Expert Models**: Research shows that expert models can be smaller than traditional large multimodal models but still effectively handle specific types of multimodal interactions. Overall, this paper successfully addresses the limitations of multimodal models in handling complex interactions by introducing the MM OE method and achieves significant performance improvements on multiple datasets and models.

MMoE: Enhancing Multimodal Models with Mixtures of Multimodal Interaction Experts

MMOE: Mixture of Multimodal Interaction Experts

Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts

HMoE: Heterogeneous Mixture of Experts for Language Modeling

MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models

Multi-Head Mixture-of-Experts

Multiple Heads are Better than One: Mixture of Modality Knowledge Experts for Entity Representation Learning

MH-MoE: Multi-Head Mixture-of-Experts

On development of multimodal named entity recognition using part-of-speech and mixture of experts

A Multitask learning model for multimodal sarcasm, sentiment and emotion recognition in conversations

Model Composition for Multimodal Large Language Models

Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts

A Survey on Mixture of Experts

CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Flex-MoE: Modeling Arbitrary Modality Combination via the Flexible Mixture-of-Experts

TMMDA: A New Token Mixup Multimodal Data Augmentation for Multimodal Sentiment Analysis

Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models

A Closer Look into Mixture-of-Experts in Large Language Models

Octavius: Mitigating Task Interference in MLLMs via LoRA-MoE

EffMulti: Efficiently Modeling Complex Multimodal Interactions for Emotion Analysis