Solving Token Gradient Conflict in Mixture-of-Experts for Large Vision-Language Model

Longrong Yang,Dong Shen,Chaoxiang Cai,Fan Yang,Size Li,Di Zhang,Xi Li
2024-08-05
Abstract:The Mixture-of-Experts (MoE) has gained increasing attention in studying Large Vision-Language Models (LVLMs). It uses a sparse model to replace the dense model, achieving comparable performance while activating fewer parameters during inference, thus significantly reducing the inference cost. Existing MoE methods in LVLMs encourage different experts to handle different tokens, and they usually employ a router to predict the routing of each token. However, the predictions are based solely on sample features and do not truly reveal the optimization directions of tokens. This may lead to severe optimization interference between different tokens assigned to an expert. To address this problem, this paper proposes a novel method based on token-level gradient analysis, i.e., Solving Token Gradient Conflict (STGC). Specifically, we first use token-level gradients to identify conflicting tokens in experts. After that, we add a specialized loss tailored to eliminate conflicts among tokens within each expert. Our method can serve as a plug-in for diverse Large Vision-Language Models, and extensive experimental results demonstrate its effectiveness. The code will be publicly available at <a class="link-external link-https" href="https://github.com/longrongyang/STGC" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve the **token gradient conflict** problem in the Mixture of Experts (MoE) system in large - scale Vision - Language Models (LVLMs). Specifically, when dealing with LVLMs, although existing MoE methods can reduce the inference cost through sparse activation, they are insufficient in routing assignment. These methods usually predict the routing of each token based on sample features without truly revealing the optimization direction of the token, which may lead to serious optimization interference between different tokens. #### Main problems 1. **Optimization interference**: Existing methods only rely on sample features for routing prediction and fail to effectively identify and handle the optimization conflicts between different tokens. 2. **Data interference**: The optimization goals of different tokens in the same expert may be inconsistent, resulting in serious data interference and affecting the model performance. ### Proposed solutions To solve the above problems, the authors propose a new method based on token - level gradient analysis - **Solve Token Gradient Conflict (STGC)**. The main steps of this method include: 1. **Conflict token identification**: - By calculating the token - level gradient, identify the tokens that conflict with the average gradient of the expert. - Specifically, define the average gradient \(\mathbf{g}_{\text{mean}}\) of all tokens within the expert, and calculate the cosine similarity between each token gradient and the average gradient. When the similarity is lower than the threshold \(\tau\), mark the token as a conflict token. 2. **Conflict elimination loss**: - Introduce a new regularization loss term to reduce the probability that the conflict token is routed to the current expert, thereby re - assigning it to other experts. - By adjusting the routing score \(p_{\text{moe}}(t_n)\), make the conflict token more likely to be assigned to different experts, thereby reducing the optimization interference. ### Advantages of the method - **Precise optimization direction**: Directly reveal the optimization direction through the token - level gradient, avoiding the uncertainty brought by routing based on sample features. - **Improve expert specialization**: By reducing conflicts, encourage experts to better focus on specific types of tokens and improve the overall performance of the model. ### Experimental verification The experimental results show that STGC can significantly improve the performance of LVLMs on multiple benchmark tests, especially in image understanding tasks. In addition, this method can be seamlessly integrated into existing LVLMs as a plug - in, with wide applicability and effectiveness. In summary, this paper effectively solves the problem of token gradient conflict in the MoE architecture by introducing the STGC method, improving the performance and robustness of LVLMs.