Abstract:Knowledge distillation (KD) is the de facto standard for compressing large-scale models into smaller ones. Prior works have explored ever more complex KD strategies involving different objective functions, teacher-ensembles, and weight inheritance. In this work we explore an alternative, yet simple approach -- active data curation as effective distillation for contrastive multimodal pretraining. Our simple online batch selection method, ACID, outperforms strong KD baselines across various model-, data- and compute-configurations. Further, we find such an active data curation strategy to in fact be complementary to standard KD, and can be effectively combined to train highly performant inference-efficient models. Our simple and scalable pretraining framework, ACED, achieves state-of-the-art results across 27 zero-shot classification and retrieval tasks with upto 11% less inference FLOPs. We further demonstrate that our ACED models yield strong vision-encoders for training generative multimodal models in the LiT-Decoder setting, outperforming larger vision encoders for image-captioning and visual question-answering tasks.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: how to effectively compress large multi - modal models (such as CLIP and SigLIP) into smaller models with higher inference efficiency without losing the performance of downstream tasks. Specifically, the author proposes a new method - Active Data Curation, which achieves this goal by selectively using training data. ### Problem Background Deploying large multi - modal foundation models (such as CLIP) on edge devices is challenging because these models have high inference costs and memory footprints. Therefore, it is necessary to compress these large models into smaller ones so that they can be efficiently deployed while maintaining performance comparable to that of large models. Traditional Knowledge Distillation (KD) techniques achieve this by matching the outputs (such as logits, features, or activations) of the student model and the teacher model. ### Solution Proposed in the Paper The author proposes a method named ACID (Active Curation as Implicit Distillation), which automatically selects samples that can reduce the performance gap between the small model (student) and the large frozen model (reference) through an online batch - selection strategy. This method not only simplifies the model - compression process but also shows better performance than traditional KD in multiple experiments. ### Main Contributions 1. **ACID Method**: A new data - selection strategy is proposed to achieve effective implicit distillation by actively curating data. 2. **ACED Framework**: By combining Explicit Distillation and ACID, a simple and efficient pre - training framework named ACED (ACID with Explicit Distillation) is developed. This framework achieves state - of - the - art results in multiple zero - shot classification and image - text retrieval tasks and reduces the FLOPs required for inference. 3. **Theoretical and Experimental Evidence**: Through theoretical analysis and extensive experimental verification, the effectiveness and scalability of the ACID method are proven, especially under reference models of different scales. ### Experimental Results The experimental results show that ACID and ACED significantly outperform existing KD methods and other compression techniques on multiple tasks. In particular, ACED performs excellently in 27 zero - shot classification and image - text retrieval tasks, reducing the inference FLOPs by up to 11%. Moreover, the visual encoder generated by ACED also performs well in image captioning and visual - question - answering tasks, outperforming larger and more resource - consuming visual encoders. ### Summary By introducing the ACID and ACED methods, this paper provides a novel and effective multi - modal model - compression scheme, solving the problem of efficient deployment of large multi - modal models on edge devices.

Active Data Curation Effectively Distills Large-Scale Multimodal Models

DCCD: Reducing Neural Network Redundancy Via Distillation

CDFKD-MFS: Collaborative Data-free Knowledge Distillation via Multi-level Feature Sharing

Up to 100x Faster Data-Free Knowledge Distillation

CDFKD-MFS: Collaborative Data-free Knowledge Distillation Via Multi-level Feature Sharing

Data-Free Adversarial Distillation

Data curation via joint example selection further accelerates multimodal learning

Adaptive Cross-Architecture Mutual Knowledge Distillation

Comparative Knowledge Distillation

DistilCSE: Effective Knowledge Distillation For Contrastive Sentence Embeddings

AdaDS: Adaptive Data Selection for Accelerating Pre-Trained Language Model Knowledge Distillation

Knowledge Distillation as Efficient Pre-training: Faster Convergence, Higher Data-efficiency, and Better Transferability

Efficient Audio Captioning with Encoder-Level Knowledge Distillation

AMD: Automatic Multi-step Distillation of Large-scale Vision Models

CLIP-KD: An Empirical Study of CLIP Model Distillation

Deep Collective Knowledge Distillation

CiT: Curation in Training for Effective Vision-Language Data

DistilDoc: Knowledge Distillation for Visually-Rich Document Applications

Data Efficient Stagewise Knowledge Distillation

Online Knowledge Distillation via Collaborative Learning

Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding