Abstract:Multimodal content generation, which leverages visual information to enhance the comprehension of cross-modal understanding, plays a critical role in Multimodal Information Retrieval. With the development of large language models (LLMs), recent research has adopted visual instruction tuning to inject the knowledge of LLMs into downstream multimodal tasks. The high complexity and great demand for resources urge researchers to study e.cient distillation solutions to transfer the knowledge from pre-trained multimodal models (teachers) to more compact student models. However, the instruction tuning for knowledge distillation in multimodal LLMs is resource-intensive and capability-restricted. The comprehension of students is highly reliant on the teacher models. To address this issue, we propose a novel Multimodal Distillation Calibration framework (MmDC). The main idea is to generate high-quality training instances that challenge student models to comprehend and prompt the teacher to calibrate the knowledge transferred to students, ultimately cultivating a better student model in downstream tasks. This framework comprises two stages: (1) multimodal alignment and (2) knowledge distillation calibration. In the.rst stage, parameter-e.cient.ne-tuning is used to enhance feature alignment between di.erent modalities. In the second stage, we develop a calibration strategy to assess the student model's capability and generate high-quality instances to calibrate knowledge distillation from teacher to student. The experiments on diverse datasets show that our framework e.ciently improves the student model's capabilities. Our 7B-size student model, after three iterations of distillation calibration, outperforms the current state-of-the-art LLaVA-13B model on the ScienceQA and LLaVA Test datasets and also exceeds other strong baselines in a zero-shot setting.

Self-Improving Teacher Cultivates Better Student: Distillation Calibration for Multimodal Large Language Models

Unlock the Power: Competitive Distillation for Multi-Modal Large Language Models

LLAVADI: What Matters For Multimodal Large Language Models Distillation

LLaVA-KD: A Framework of Distilling Multimodal Large Language Models

Teaching-Assistant-in-the-Loop: Improving Knowledge Distillation from Imperfect Teacher Models in Low-Budget Scenarios

FIRST: Teach A Reliable Large Language Model Through Efficient Trustworthy Distillation

Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Model

Layerwised multimodal knowledge distillation for vision-language pretrained model

Enhancing Knowledge Distillation of Large Language Models through Efficient Multi-Modal Distribution Alignment

Pre-training Distillation for Large Language Models: A Design Space Exploration

Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Representation Learning

DDK: Distilling Domain Knowledge for Efficient Large Language Models

Using Advanced LLMs to Enhance Smaller LLMs: An Interpretable Knowledge Distillation Approach

Multi-Granularity Semantic Revision for Large Language Model Distillation

Distillation Matters: Empowering Sequential Recommenders to Match the Performance of Large Language Model

Reinforced Multi-Teacher Selection for Knowledge Distillation

Distillation Matters: Empowering Sequential Recommenders to Match the Performance of Large Language Models

Distilling Large Vision-Language Model with Out-of-Distribution Generalizability

Can a student Large Language Model perform as well as it's teacher?

Module-wise Adaptive Distillation for Multimodality Foundation Models