Abstract:We introduce LLaVA-MoD, a novel framework designed to enable the efficient training of small-scale Multimodal Language Models (s-MLLM) by distilling knowledge from large-scale MLLM (l-MLLM). Our approach tackles two fundamental challenges in MLLM distillation. First, we optimize the network structure of s-MLLM by integrating a sparse Mixture of Experts (MoE) architecture into the language model, striking a balance between computational efficiency and model expressiveness. Second, we propose a progressive knowledge transfer strategy to ensure comprehensive knowledge migration. This strategy begins with mimic distillation, where we minimize the Kullback-Leibler (KL) divergence between output distributions to enable the student model to emulate the teacher network's understanding. Following this, we introduce preference distillation via Direct Preference Optimization (DPO), where the key lies in treating l-MLLM as the reference model. During this phase, the s-MLLM's ability to discriminate between superior and inferior examples is significantly enhanced beyond l-MLLM, leading to a better student that surpasses its teacher, particularly in hallucination benchmarks. Extensive experiments demonstrate that LLaVA-MoD outperforms existing models across various multimodal benchmarks while maintaining a minimal number of activated parameters and low computational costs. Remarkably, LLaVA-MoD, with only 2B activated parameters, surpasses Qwen-VL-Chat-7B by an average of 8.8% across benchmarks, using merely 0.3% of the training data and 23% trainable parameters. These results underscore LLaVA-MoD's ability to effectively distill comprehensive knowledge from its teacher model, paving the way for the development of more efficient MLLMs. The code will be available on: <a class="link-external link-https" href="https://github.com/shufangxun/LLaVA-MoD" rel="external noopener nofollow">this https URL</a>.

Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts

MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models

Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Octavius: Mitigating Task Interference in MLLMs via LoRA-MoE

MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More

Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts

LLaVA-MoLE: Sparse Mixture of LoRA Experts for Mitigating Data Conflicts in Instruction Finetuning MLLMs

AquilaMoE: Efficient Training for MoE Models with Scale-Up and Scale-Out Strategies

HMoE: Heterogeneous Mixture of Experts for Language Modeling

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

MoExtend: Tuning New Experts for Modality and Task Extension

MoVA: Adapting Mixture of Vision Experts to Multimodal Context

SMoA: Improving Multi-agent Large Language Models with Sparse Mixture-of-Agents

Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts

OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models

WDMoE: Wireless Distributed Large Language Models with Mixture of Experts

Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts

AT-MoE: Adaptive Task-planning Mixture of Experts via LoRA Approach

A Closer Look into Mixture-of-Experts in Large Language Models

LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation