Abstract:Large vision-language models have achieved outstanding performance, but their size and computational requirements make their deployment on resource-constrained devices and time-sensitive tasks impractical. Model distillation, the process of creating smaller, faster models that maintain the performance of larger models, is a promising direction towards the solution. This paper investigates the distillation of visual representations in large teacher vision-language models into lightweight student models using a small- or mid-scale dataset. Notably, this study focuses on open-vocabulary out-of-distribution (OOD) generalization, a challenging problem that has been overlooked in previous model distillation literature. We propose two principles from vision and language modality perspectives to enhance student's OOD generalization: (1) by better imitating teacher's visual representation space, and carefully promoting better coherence in vision-language alignment with the teacher; (2) by enriching the teacher's language representations with informative and finegrained semantic attributes to effectively distinguish between different labels. We propose several metrics and conduct extensive experiments to investigate their techniques. The results demonstrate significant improvements in zero-shot and few-shot student performance on open-vocabulary out-of-distribution classification, highlighting the effectiveness of our proposed approaches. Poster: <a class="link-external link-https" href="https://xuanlinli17.github.io/pdfs/iccv23_large_vlm_distillation_poster.pdf" rel="external noopener nofollow">this https URL</a> Code: <a class="link-external link-https" href="https://github.com/xuanlinli17/large_vlm_distillation_ood" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is that although large vision - language models (such as CLIP, GLIP, OFA, etc.) perform excellently in terms of performance, their huge model sizes and high computational requirements make it difficult to deploy them on resource - constrained devices or use them for time - sensitive tasks. To solve this problem, the paper explores creating smaller and faster models through model distillation techniques. These models can maintain the performance of large models while having strong out - of - distribution (OOD) generalization ability for open vocabularies. Specifically, the research focuses on how to distill visual representations from large teacher vision - language models and transfer them to lightweight student models, especially in the case of using small or medium - sized datasets. In addition, the paper pays special attention to the OOD generalization problem of open vocabularies, which is a challenging problem rarely involved in previous model distillation literature. To achieve the above goals, the author proposes two principles to enhance the OOD generalization ability of student models: 1. **Better Imitation of the Teacher's Visual Representation Space**: By more accurately imitating the high - dimensional visual feature space of the teacher model and promoting the consistency of vision - language alignment at the same time, the OOD generalization ability of the student model is improved. 2. **Enriching the Teacher's Language Representation**: By introducing more informative, fine - grained and meaningful semantic attributes to enhance the language representation of the teacher model, different labels can be effectively distinguished, further improving the OOD generalization ability of the student model. The author also proposes a series of metrics and conducts extensive experiments to verify the effectiveness of the proposed method. The experimental results show that these methods significantly improve the OOD classification performance of student models in zero - shot and few - shot learning tasks, proving the effectiveness of the proposed method.

Distilling Large Vision-Language Model with Out-of-Distribution Generalizability

LLAVADI: What Matters For Multimodal Large Language Models Distillation

Distilling Out-of-Distribution Robustness from Vision-Language Foundation Models

Vision-Language Alignment Learning Under Affinity and Divergence Principles for Few-Shot Out-of-Distribution Generalization

Leveraging Vision-Language Models for Improving Domain Generalization in Image Classification

LLaVA-KD: A Framework of Distilling Multimodal Large Language Models

Self-Improving Teacher Cultivates Better Student: Distillation Calibration for Multimodal Large Language Models

EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning

Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models

Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Model

Unlock the Power: Competitive Distillation for Multi-Modal Large Language Models

Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Representation Learning

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

Select and Distill: Selective Dual-Teacher Knowledge Transfer for Continual Learning on Vision-Language Models

Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoising In-Context Learning

On Good Practices for Task-Specific Distillation of Large Pretrained Visual Models

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

Small Language Model Meets with Reinforced Vision Vocabulary

Weak Distribution Detectors Lead to Stronger Generalizability of Vision-Language Prompt Tuning