Abstract:Although the vision-and-language pretraining (VLP) equipped cross-modal image-text retrieval (ITR) has achieved remarkable progress in the past two years, it suffers from a major drawback: the ever-increasing size of VLP models restrict its deployment to real-world search scenarios (where the high latency is unacceptable). To alleviate this problem, we present a novel plug-in dynamic contrastive distillation (DCD) framework to compress the large VLP models for the ITR task. Technically, we face the following two challenges: 1) the typical uni-modal metric learning approach is difficult to directly apply to cross-modal task, due to the limited GPU memory to optimize too many negative samples during handling cross-modal fusion features. 2) it is inefficient to static optimize the student network from different hard samples, which have different effects on distillation learning and student network optimization. We try to overcome these challenges from two points. First, to achieve multi-modal contrastive learning, and balance the training costs and effects, we propose to use a teacher network to estimate the difficult samples for students, making the students absorb the powerful knowledge from pre-trained teachers, and master the knowledge from hard samples. Second, to dynamic learn from hard sample pairs, we propose dynamic distillation to dynamically learn samples of different difficulties, from the perspective of better balancing the difficulty of knowledge and students' self-learning ability. We successfully apply our proposed DCD strategy on two state-of-the-art vision-language pretrained models, i.e. ViLT and METER. Extensive experiments on MS-COCO and Flickr 30 K benchmarks show the effectiveness and efficiency of our DCD framework. Encouragingly, we can speed up the inference at least 129 × compared to the existing ITR models. We further provide in-depth analyses and discussions that explain where the performance improvement comes from. We hope our work can shed light on other tasks that require distillation and contrastive learning.

CDHD: Contrastive Dreamer for Hint Distillation

DCCD: Reducing Neural Network Redundancy Via Distillation

DCD: Discriminative and Consistent Representation Distillation

Hybrid mix-up contrastive knowledge distillation

DistilCSE: Effective Knowledge Distillation For Contrastive Sentence Embeddings

Class Incremental Learning with Multi-Teacher Distillation

Inherit With Distillation and Evolve With Contrast: Exploring Class Incremental Semantic Segmentation Without Exemplar Memory

Pixel-Wise Contrastive Distillation

Class Incremental Learning with Deep Contrastive Learning and Attention Distillation

Contrastive Continual Learning with Importance Sampling and Prototype-Instance Relation Distillation

Dynamic Contrastive Distillation for Image-Text Retrieval

CKD: Contrastive Knowledge Distillation from A Sample-wise Perspective

Class incremental learning of remote sensing images based on class similarity distillation

Multi-teacher Contrastive Knowledge Inversion for Data-Free Distillation

Learning Contrastive Self-Distillation for Ultra-Fine-Grained Visual Categorization Targeting Limited Samples

Hybrid Distillation: Connecting Masked Autoencoders with Contrastive Learners

Double Confidence Calibration Focused Distillation for Task-Incremental Learning

Hybrid Memory Replay: Blending Real and Distilled Data for Class Incremental Learning

Few-Shot Class-Incremental Learning Via Class-Aware Bilateral Distillation

Less confidence, less forgetting: Learning with a humbler teacher in exemplar-free Class-Incremental learning

CLIP-CID: Efficient CLIP Distillation via Cluster-Instance Discrimination