LLAVADI: What Matters For Multimodal Large Language Models Distillation

Shilin Xu,Xiangtai Li,Haobo Yuan,Lu Qi,Yunhai Tong,Ming-Hsuan Yang

2024-07-28

Abstract:The recent surge in Multimodal Large Language Models (MLLMs) has showcased their remarkable potential for achieving generalized intelligence by integrating visual understanding into Large Language Models.Nevertheless, the sheer model size of MLLMs leads to substantial memory and computational demands that hinder their widespread deployment. In this work, we do not propose a new efficient model structure or train small-scale MLLMs from scratch. Instead, we focus on what matters for training small-scale MLLMs through knowledge distillation, which is the first step from the multimodal distillation perspective. Our extensive studies involve training strategies, model choices, and distillation algorithms in the knowledge distillation process. These results show that joint alignment for both tokens and logit alignment plays critical roles in teacher-student frameworks. In addition, we draw a series of intriguing observations from this study. By evaluating different benchmarks and proper strategy, even a 2.7B small-scale model can perform on par with larger models with 7B or 13B parameters. Our code and models will be publicly available for further research.

Computation and Language,Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper primarily focuses on how to train small-scale Multimodal Large Language Models (MLLMs) through knowledge distillation methods. Specifically: 1. **Background and Challenges**: - Multimodal Large Language Models (MLLMs) have shown great potential in integrating visual understanding capabilities into large language models. - However, the enormous scale of these models leads to high memory and computational demands, limiting their deployment in practical applications. 2. **Research Objectives**: - Not to propose new efficient model architectures or train small-scale MLLMs from scratch. - Focus on the key factors in training small-scale MLLMs through knowledge distillation methods. - Experimentally study different training strategies, model selection, and distillation algorithms. 3. **Core Issues**: - Explore which aspects are most critical for training small-scale MLLMs through knowledge distillation. - Specifically include feature embedding distillation, logit-level distillation, affinity-aware distillation, and data-driven knowledge distillation. 4. **Contributions**: - Propose a framework named LLAVADI, which achieves efficient multimodal large language model distillation by jointly distilling features and logits, combined with teacher-generated data and instruction-tuning data. - Demonstrate that simple yet effective logit and feature distillation methods can significantly enhance performance. - Validate that adding teacher-generated data and instruction-tuning data can further improve performance. - Validate the superiority and efficiency of LLAVADI across multiple benchmarks. Through this research, the authors aim to find the most effective distillation methods to achieve high-performance small-scale multimodal large language models under resource constraints.

LLAVADI: What Matters For Multimodal Large Language Models Distillation

LLaVA-KD: A Framework of Distilling Multimodal Large Language Models

Self-Improving Teacher Cultivates Better Student: Distillation Calibration for Multimodal Large Language Models

LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation

Unlock the Power: Competitive Distillation for Multi-Modal Large Language Models

MiniLLM: Knowledge Distillation of Large Language Models

Pre-training Distillation for Large Language Models: A Design Space Exploration

Mixed Distillation Helps Smaller Language Model Better Reasoning

Multi-Granularity Semantic Revision for Large Language Model Distillation

BiLD: Bi-directional Logits Difference Loss for Large Language Model Distillation

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

Survey on Knowledge Distillation for Large Language Models: Methods, Evaluation, and Application

Distilling Large Vision-Language Model with Out-of-Distribution Generalizability

Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Model

XtremeDistil: Multi-stage Distillation for Massive Multilingual Models

Distillation Matters: Empowering Sequential Recommenders to Match the Performance of Large Language Model

LLMR: Knowledge Distillation with a Large Language Model-Induced Reward

Enhancing Knowledge Distillation of Large Language Models through Efficient Multi-Modal Distribution Alignment

Distillation Matters: Empowering Sequential Recommenders to Match the Performance of Large Language Models