Abstract:The Mixture-of-Experts (MoE) has gained increasing attention in studying Large Vision-Language Models (LVLMs). It uses a sparse model to replace the dense model, achieving comparable performance while activating fewer parameters during inference, thus significantly reducing the inference cost. Existing MoE methods in LVLMs encourage different experts to handle different tokens, and they usually employ a router to predict the routing of each token. However, the predictions are based solely on sample features and do not truly reveal the optimization directions of tokens. This may lead to severe optimization interference between different tokens assigned to an expert. To address this problem, this paper proposes a novel method based on token-level gradient analysis, i.e., Solving Token Gradient Conflict (STGC). Specifically, we first use token-level gradients to identify conflicting tokens in experts. After that, we add a specialized loss tailored to eliminate conflicts among tokens within each expert. Our method can serve as a plug-in for diverse Large Vision-Language Models, and extensive experimental results demonstrate its effectiveness. The code will be publicly available at <a class="link-external link-https" href="https://github.com/longrongyang/STGC" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the **token gradient conflict** problem in the Mixture of Experts (MoE) system in large - scale Vision - Language Models (LVLMs). Specifically, when dealing with LVLMs, although existing MoE methods can reduce the inference cost through sparse activation, they are insufficient in routing assignment. These methods usually predict the routing of each token based on sample features without truly revealing the optimization direction of the token, which may lead to serious optimization interference between different tokens. #### Main problems 1. **Optimization interference**: Existing methods only rely on sample features for routing prediction and fail to effectively identify and handle the optimization conflicts between different tokens. 2. **Data interference**: The optimization goals of different tokens in the same expert may be inconsistent, resulting in serious data interference and affecting the model performance. ### Proposed solutions To solve the above problems, the authors propose a new method based on token - level gradient analysis - **Solve Token Gradient Conflict (STGC)**. The main steps of this method include: 1. **Conflict token identification**: - By calculating the token - level gradient, identify the tokens that conflict with the average gradient of the expert. - Specifically, define the average gradient \(\mathbf{g}_{\text{mean}}\) of all tokens within the expert, and calculate the cosine similarity between each token gradient and the average gradient. When the similarity is lower than the threshold \(\tau\), mark the token as a conflict token. 2. **Conflict elimination loss**: - Introduce a new regularization loss term to reduce the probability that the conflict token is routed to the current expert, thereby re - assigning it to other experts. - By adjusting the routing score \(p_{\text{moe}}(t_n)\), make the conflict token more likely to be assigned to different experts, thereby reducing the optimization interference. ### Advantages of the method - **Precise optimization direction**: Directly reveal the optimization direction through the token - level gradient, avoiding the uncertainty brought by routing based on sample features. - **Improve expert specialization**: By reducing conflicts, encourage experts to better focus on specific types of tokens and improve the overall performance of the model. ### Experimental verification The experimental results show that STGC can significantly improve the performance of LVLMs on multiple benchmark tests, especially in image understanding tasks. In addition, this method can be seamlessly integrated into existing LVLMs as a plug - in, with wide applicability and effectiveness. In summary, this paper effectively solves the problem of token gradient conflict in the MoE architecture by introducing the STGC method, improving the performance and robustness of LVLMs.

Solving Token Gradient Conflict in Mixture-of-Experts for Large Vision-Language Model

Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models

MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models

Expert-Token Resonance: Redefining MoE Routing through Affinity-Driven Active Selection

Diversifying the Mixture-of-Experts Representation for Language Models with Orthogonal Optimizer

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Merging Experts into One: Improving Computational Efficiency of Mixture of Experts

Mixture of Diverse Size Experts

Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models

MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More

Mixture of Cluster-conditional LoRA Experts for Vision-language Instruction Tuning

LLaVA-MoLE: Sparse Mixture of LoRA Experts for Mitigating Data Conflicts in Instruction Finetuning MLLMs

EvoMoE: An Evolutional Mixture-of-Experts Training Framework via Dense-To-Sparse Gate

Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters

GRIN: GRadient-INformed MoE

Multi-Head Mixture-of-Experts

HMoE: Heterogeneous Mixture of Experts for Language Modeling

Unchosen Experts Can Contribute Too: Unleashing MoE Models' Power by Self-Contrast

AdaMoE: Token-Adaptive Routing with Null Experts for Mixture-of-Experts Language Models

MoVA: Adapting Mixture of Vision Experts to Multimodal Context