Abstract:As the key component in multimodal large language models (MLLMs), the ability of the visual encoder greatly affects MLLM's understanding on diverse image content. Although some large-scale pretrained vision encoders such as vision encoders in CLIP and DINOv2 have brought promising performance, we found that there is still no single vision encoder that can dominate various image content understanding, e.g., the CLIP vision encoder leads to outstanding results on general image understanding but poor performance on document or chart content. To alleviate the bias of CLIP vision encoder, we first delve into the inherent behavior of different pre-trained vision encoders and then propose the MoVA, a powerful and novel MLLM, adaptively routing and fusing task-specific vision experts with a coarse-to-fine mechanism. In the coarse-grained stage, we design a context-aware expert routing strategy to dynamically select the most suitable vision experts according to the user instruction, input image, and expertise of vision experts. This benefits from the powerful model function understanding ability of the large language model (LLM). In the fine-grained stage, we elaborately conduct the mixture-of-vision-expert adapter (MoV-Adapter) to extract and fuse task-specific knowledge from various experts. This coarse-to-fine paradigm effectively leverages representations from experts based on multimodal context and model expertise, further enhancing the generalization ability. We conduct extensive experiments to evaluate the effectiveness of the proposed approach. Without any bells and whistles, MoVA can achieve significant performance gains over current state-of-the-art methods in a wide range of challenging multimodal benchmarks.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the inconsistency in the ability of existing visual encoders to understand different image contents in multimodal large language models (MLLMs). Although some large - scale pre - trained visual encoders such as CLIP and DINOv2 perform excellently on certain tasks, they perform poorly when dealing with specific types of content (such as documents or charts). This deviation limits the generalization ability of MLLMs in diverse tasks. To solve this problem, the author first conducts an in - depth analysis of the intrinsic behaviors of different pre - trained visual encoders, and then proposes a new method named MoV A. MoV A adaptively routes and fuses task - specific visual experts through a coarse - to - fine - grained mechanism, aiming to improve the model's understanding and generalization ability on various image contents. Specifically, MoV A consists of two main stages: 1. **Coarse - grained Context - Aware Expert Routing**: A context - based expert routing strategy is designed to dynamically select the most appropriate visual expert according to the user's instructions, the input image, and the expertise of the visual experts. This process utilizes the powerful function understanding ability of large language models (LLM). 2. **Fine - grained Expert Fusion**: Through the Mixture of Visual Experts Adapter (MoV - Adapter), task - specific knowledge is extracted and fused from multiple experts. This mechanism effectively utilizes the representations of experts based on multimodal context and model expertise, further enhancing the model's generalization ability. Through these two stages, MoV A can achieve significant performance improvements in a variety of challenging multimodal benchmark tests, surpassing the current state - of - the - art methods.

MoVA: Adapting Mixture of Vision Experts to Multimodal Context

VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models

MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model

Multi-modal Intent Detection with LVAMoE: the Language-Visual-Audio Mixture of Experts

EVLM: An Efficient Vision-Language Model for Visual Understanding

MoExtend: Tuning New Experts for Modality and Task Extension

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models

MMA: Multi-Modal Adapter for Vision-Language Models

MouSi: Poly-Visual-Expert Vision-Language Models

ViMoE: An Empirical Study of Designing Vision Mixture-of-Experts

Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models

Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training

Q-MoE: Connector for MLLMs with Text-Driven Routing

MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding

MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts

Dense Connector for MLLMs

LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid Architecture