Abstract:We introduce LLaVA-MoD, a novel framework designed to enable the efficient training of small-scale Multimodal Language Models (s-MLLM) by distilling knowledge from large-scale MLLM (l-MLLM). Our approach tackles two fundamental challenges in MLLM distillation. First, we optimize the network structure of s-MLLM by integrating a sparse Mixture of Experts (MoE) architecture into the language model, striking a balance between computational efficiency and model expressiveness. Second, we propose a progressive knowledge transfer strategy to ensure comprehensive knowledge migration. This strategy begins with mimic distillation, where we minimize the Kullback-Leibler (KL) divergence between output distributions to enable the student model to emulate the teacher network's understanding. Following this, we introduce preference distillation via Direct Preference Optimization (DPO), where the key lies in treating l-MLLM as the reference model. During this phase, the s-MLLM's ability to discriminate between superior and inferior examples is significantly enhanced beyond l-MLLM, leading to a better student that surpasses its teacher, particularly in hallucination benchmarks. Extensive experiments demonstrate that LLaVA-MoD outperforms existing models across various multimodal benchmarks while maintaining a minimal number of activated parameters and low computational costs. Remarkably, LLaVA-MoD, with only 2B activated parameters, surpasses Qwen-VL-Chat-7B by an average of 8.8% across benchmarks, using merely 0.3% of the training data and 23% trainable parameters. These results underscore LLaVA-MoD's ability to effectively distill comprehensive knowledge from its teacher model, paving the way for the development of more efficient MLLMs. The code will be available on: <a class="link-external link-https" href="https://github.com/shufangxun/LLaVA-MoD" rel="external noopener nofollow">this https URL</a>.

VinaLLaMA: LLaMA-based Vietnamese Foundation Model

LaVy: Vietnamese Multimodal Large Language Model

ViLLM-Eval: A Comprehensive Evaluation Suite for Vietnamese Large Language Models

Vintern-1B: An Efficient Multimodal Large Language Model for Vietnamese

Vi-Mistral-X: Building a Vietnamese Language Model with Advanced Continual Pre-training

Crossing Linguistic Horizons: Finetuning and Comprehensive Evaluation of Vietnamese Large Language Models

LLaVaOLMoBitnet1B: Ternary LLM goes Multimodal!

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Efficient Finetuning Large Language Models For Vietnamese Chatbot

Towards Comprehensive Vietnamese Retrieval-Augmented Generation and Large Language Models

SeaLLMs -- Large Language Models for Southeast Asia

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Tamil-Llama: A New Tamil Language Model Based on Llama 2

ViDeBERTa: A powerful pre-trained language model for Vietnamese

Vietnamese AI Generated Text Detection

LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation

Evaluating Large Language Model Capability in Vietnamese Fact-Checking Data Generation

ViLLA: Fine-Grained Vision-Language Representation Learning from Real-World Data

Yo'LLaVA: Your Personalized Language and Vision Assistant

LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model