MMoE: Enhancing Multimodal Models with Mixtures of Multimodal Interaction Experts

Haofei Yu,Zhengyang Qi,Lawrence Jang,Ruslan Salakhutdinov,Louis-Philippe Morency,Paul Pu Liang
2024-09-26
Abstract:Advances in multimodal models have greatly improved how interactions relevant to various tasks are modeled. Today's multimodal models mainly focus on the correspondence between images and text, using this for tasks like image-text matching. However, this covers only a subset of real-world interactions. Novel interactions, such as sarcasm expressed through opposing spoken words and gestures or humor expressed through utterances and tone of voice, remain challenging. In this paper, we introduce an approach to enhance multimodal models, which we call Multimodal Mixtures of Experts (MMoE). The key idea in MMoE is to train separate expert models for each type of multimodal interaction, such as redundancy present in both modalities, uniqueness in one modality, or synergy that emerges when both modalities are fused. On a sarcasm detection task (MUStARD) and a humor detection task (URFUNNY), we obtain new state-of-the-art results. MMoE is also able to be applied to various types of models to gain improvement.
Computation and Language
What problem does this paper attempt to address?
### The Problem Addressed by the Paper This paper attempts to address the limitations of current multimodal models in handling different types of multimodal interactions. Specifically, existing multimodal models mainly focus on the correspondence between images and text, suitable for tasks such as image-text matching. However, these models perform poorly when dealing with more complex multimodal interactions, such as sarcasm expressed through contradictory speech and gestures, or humor conveyed through voice and tone. To solve this problem, the authors propose a method called **Multimodal Mixtures of Experts (MM OE)**. The core idea of this method is to train specialized expert models for each type of multimodal interaction, including redundant information, unique information, and collaborative information. In this way, MM OE can better capture and handle different types of multimodal interactions, thereby improving performance in complex tasks. ### Main Contributions 1. **Proposed MM OE Method**: By training multiple specialized expert models to handle different types of multimodal interactions, the model's performance in complex tasks is improved. 2. **New SOTA Results**: Achieved new state-of-the-art results on two multimodal datasets (MUStARD and URFunny). 3. **Broad Applicability**: MM OE can be applied to various types of models, including Vision-Language Models (VLM), Multimodal Large Language Models (MLLM), and Large Language Models for image description (LLM), and shows improvements across these models. ### Experimental Results - **Overall Comparison**: On the MUStARD and URFunny datasets, MM OE outperformed existing state-of-the-art models. Specifically, on the MUStARD dataset, the F1 score increased by 1.35 points; on the URFunny dataset, the accuracy increased by 0.84 points. - **Improvements on Different Models**: Applying MM OE to ALBEF, BLIP2, or Qwen2 significantly improved performance. Notably, Qwen2-1.5B achieved a 6.96-point increase in F1 score on the MUStARD dataset, becoming the new state-of-the-art model for this task. ### Analysis - **Limitations of Current Models**: Existing models perform poorly in handling collaborative information, while their performance on redundant and unique information is relatively better. This indicates that collaborative information is a major challenge in multimodal interactions. - **Performance of Expert Models**: Expert models specifically targeting collaborative and redundant information showed significant performance improvements on corresponding types of interaction data. This validates the effectiveness of the MM OE method. - **Scale of Expert Models**: Research shows that expert models can be smaller than traditional large multimodal models but still effectively handle specific types of multimodal interactions. Overall, this paper successfully addresses the limitations of multimodal models in handling complex interactions by introducing the MM OE method and achieves significant performance improvements on multiple datasets and models.