Abstract:Although the vision-and-language pretraining (VLP) equipped cross-modal image-text retrieval (ITR) has achieved remarkable progress in the past two years, it suffers from a major drawback: the ever-increasing size of VLP models restrict its deployment to real-world search scenarios (where the high latency is unacceptable). To alleviate this problem, we present a novel plug-in dynamic contrastive distillation (DCD) framework to compress the large VLP models for the ITR task. Technically, we face the following two challenges: 1) the typical uni-modal metric learning approach is difficult to directly apply to cross-modal task, due to the limited GPU memory to optimize too many negative samples during handling cross-modal fusion features. 2) it is inefficient to static optimize the student network from different hard samples, which have different effects on distillation learning and student network optimization. We try to overcome these challenges from two points. First, to achieve multi-modal contrastive learning, and balance the training costs and effects, we propose to use a teacher network to estimate the difficult samples for students, making the students absorb the powerful knowledge from pre-trained teachers, and master the knowledge from hard samples. Second, to dynamic learn from hard sample pairs, we propose dynamic distillation to dynamically learn samples of different difficulties, from the perspective of better balancing the difficulty of knowledge and students' self-learning ability. We successfully apply our proposed DCD strategy on two state-of-the-art vision-language pretrained models, i.e. ViLT and METER. Extensive experiments on MS-COCO and Flickr 30 K benchmarks show the effectiveness and efficiency of our DCD framework. Encouragingly, we can speed up the inference at least 129 × compared to the existing ITR models. We further provide in-depth analyses and discussions that explain where the performance improvement comes from. We hope our work can shed light on other tasks that require distillation and contrastive learning.

A Lightweight and Effective Multi-View Knowledge Distillation Framework for Text-Image Retrieval

Towards Better Entity Linking with Multi-View Enhanced Distillation

VLM-KD: Knowledge Distillation from VLM for Long-Tail Visual Recognition

ViLEM: Visual-Language Error Modeling for Image-Text Retrieval

MCAD: Multi-teacher Cross-modal Alignment Distillation for efficient image-text retrieval

A Fast and Accurate Method for Remote Sensing Image-Text Retrieval Based on Large Model Knowledge Distillation

Dynamic Contrastive Distillation for Image-Text Retrieval

ConaCLIP: Exploring Distillation of Fully-Connected Knowledge Interaction Graph for Lightweight Text-Image Retrieval

Dual Learning with Dynamic Knowledge Distillation for Partially Relevant Video Retrieval

Triple-View Knowledge Distillation for Semi-Supervised Semantic Segmentation

Distilled Dual-Encoder Model for Vision-Language Understanding

Select and Distill: Selective Dual-Teacher Knowledge Transfer for Continual Learning on Vision-Language Models

Unlock the Power: Competitive Distillation for Multi-Modal Large Language Models

DLIP: Distilling Language-Image Pre-training

Structured Knowledge Distillation Towards Efficient and Compact Multi-View 3D Detection

Learning Lightweight Object Detectors via Multi-Teacher Progressive Distillation

A Good Student is Cooperative and Reliable: CNN-Transformer Collaborative Learning for Semantic Segmentation

Online Knowledge Distillation Via Mutual Contrastive Learning for Visual Recognition

RSKD: Enhanced medical image segmentation via multi-layer, rank-sensitive knowledge distillation in Vision Transformer models

Lightweight Self-Knowledge Distillation with Multi-source Information Fusion

Multi-target Knowledge Distillation Via Student Self-reflection