LLAVADI: What Matters For Multimodal Large Language Models Distillation

Shilin Xu,Xiangtai Li,Haobo Yuan,Lu Qi,Yunhai Tong,Ming-Hsuan Yang
2024-07-28
Abstract:The recent surge in Multimodal Large Language Models (MLLMs) has showcased their remarkable potential for achieving generalized intelligence by integrating visual understanding into Large Language Models.Nevertheless, the sheer model size of MLLMs leads to substantial memory and computational demands that hinder their widespread deployment. In this work, we do not propose a new efficient model structure or train small-scale MLLMs from scratch. Instead, we focus on what matters for training small-scale MLLMs through knowledge distillation, which is the first step from the multimodal distillation perspective. Our extensive studies involve training strategies, model choices, and distillation algorithms in the knowledge distillation process. These results show that joint alignment for both tokens and logit alignment plays critical roles in teacher-student frameworks. In addition, we draw a series of intriguing observations from this study. By evaluating different benchmarks and proper strategy, even a 2.7B small-scale model can perform on par with larger models with 7B or 13B parameters. Our code and models will be publicly available for further research.
Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper primarily focuses on how to train small-scale Multimodal Large Language Models (MLLMs) through knowledge distillation methods. Specifically: 1. **Background and Challenges**: - Multimodal Large Language Models (MLLMs) have shown great potential in integrating visual understanding capabilities into large language models. - However, the enormous scale of these models leads to high memory and computational demands, limiting their deployment in practical applications. 2. **Research Objectives**: - Not to propose new efficient model architectures or train small-scale MLLMs from scratch. - Focus on the key factors in training small-scale MLLMs through knowledge distillation methods. - Experimentally study different training strategies, model selection, and distillation algorithms. 3. **Core Issues**: - Explore which aspects are most critical for training small-scale MLLMs through knowledge distillation. - Specifically include feature embedding distillation, logit-level distillation, affinity-aware distillation, and data-driven knowledge distillation. 4. **Contributions**: - Propose a framework named LLAVADI, which achieves efficient multimodal large language model distillation by jointly distilling features and logits, combined with teacher-generated data and instruction-tuning data. - Demonstrate that simple yet effective logit and feature distillation methods can significantly enhance performance. - Validate that adding teacher-generated data and instruction-tuning data can further improve performance. - Validate the superiority and efficiency of LLAVADI across multiple benchmarks. Through this research, the authors aim to find the most effective distillation methods to achieve high-performance small-scale multimodal large language models under resource constraints.