Abstract:Multimodal content generation, which leverages visual information to enhance the comprehension of cross-modal understanding, plays a critical role in Multimodal Information Retrieval. With the development of large language models (LLMs), recent research has adopted visual instruction tuning to inject the knowledge of LLMs into downstream multimodal tasks. The high complexity and great demand for resources urge researchers to study e.cient distillation solutions to transfer the knowledge from pre-trained multimodal models (teachers) to more compact student models. However, the instruction tuning for knowledge distillation in multimodal LLMs is resource-intensive and capability-restricted. The comprehension of students is highly reliant on the teacher models. To address this issue, we propose a novel Multimodal Distillation Calibration framework (MmDC). The main idea is to generate high-quality training instances that challenge student models to comprehend and prompt the teacher to calibrate the knowledge transferred to students, ultimately cultivating a better student model in downstream tasks. This framework comprises two stages: (1) multimodal alignment and (2) knowledge distillation calibration. In the.rst stage, parameter-e.cient.ne-tuning is used to enhance feature alignment between di.erent modalities. In the second stage, we develop a calibration strategy to assess the student model's capability and generate high-quality instances to calibrate knowledge distillation from teacher to student. The experiments on diverse datasets show that our framework e.ciently improves the student model's capabilities. Our 7B-size student model, after three iterations of distillation calibration, outperforms the current state-of-the-art LLaVA-13B model on the ScienceQA and LLaVA Test datasets and also exceeds other strong baselines in a zero-shot setting.

Medical Vision-Language Representation Learning with Cross-Modal Multi-Teacher Contrastive Distillation

Cross-modality Online Distillation for Multi-View Action Recognition

Gradient modulated contrastive distillation of low-rank multi-modal knowledge for disease diagnosis

A Generalization Theory of Cross-Modality Distillation with Contrastive Learning

Multi-Level Contrastive Student-Teacher Structure for Semi-Supervised Medical Image Segmentation

Benchmarking Vision-Language Contrastive Methods for Medical Representation Learning

Learnable Cross-modal Knowledge Distillation for Multi-modal Learning with Missing Modality

Multimodal Contrastive Training for Visual Representation Learning

Vision-Language Pre-Training with Triple Contrastive Learning

Unlock the Power: Competitive Distillation for Multi-Modal Large Language Models

Is Contrastive Distillation Enough for Learning Comprehensive 3D Representations?

Multi-task Paired Masking with Alignment Modeling for Medical Vision-Language Pre-training

Improving Multi-Modal Learning with Uni-Modal Teachers

Self-Improving Teacher Cultivates Better Student: Distillation Calibration for Multimodal Large Language Models

Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training

Enhanced Multimodal Representation Learning with Cross-modal KD.

Enhancing Biomedical Multi-modal Representation Learning with Multi-scale Pre-training and Perturbed Report Discrimination

Improving the Modality Representation with Multi-View Contrastive Learning for Multimodal Sentiment Analysis

Contrastive Knowledge Distillation for Robust Multimodal Sentiment Analysis

CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training

MCAD: Multi-teacher Cross-modal Alignment Distillation for efficient image-text retrieval