Abstract:Although the vision-and-language pretraining (VLP) equipped cross-modal image-text retrieval (ITR) has achieved remarkable progress in the past two years, it suffers from a major drawback: the ever-increasing size of VLP models restrict its deployment to real-world search scenarios (where the high latency is unacceptable). To alleviate this problem, we present a novel plug-in dynamic contrastive distillation (DCD) framework to compress the large VLP models for the ITR task. Technically, we face the following two challenges: 1) the typical uni-modal metric learning approach is difficult to directly apply to cross-modal task, due to the limited GPU memory to optimize too many negative samples during handling cross-modal fusion features. 2) it is inefficient to static optimize the student network from different hard samples, which have different effects on distillation learning and student network optimization. We try to overcome these challenges from two points. First, to achieve multi-modal contrastive learning, and balance the training costs and effects, we propose to use a teacher network to estimate the difficult samples for students, making the students absorb the powerful knowledge from pre-trained teachers, and master the knowledge from hard samples. Second, to dynamic learn from hard sample pairs, we propose dynamic distillation to dynamically learn samples of different difficulties, from the perspective of better balancing the difficulty of knowledge and students' self-learning ability. We successfully apply our proposed DCD strategy on two state-of-the-art vision-language pretrained models, i.e. ViLT and METER. Extensive experiments on MS-COCO and Flickr 30 K benchmarks show the effectiveness and efficiency of our DCD framework. Encouragingly, we can speed up the inference at least 129 × compared to the existing ITR models. We further provide in-depth analyses and discussions that explain where the performance improvement comes from. We hope our work can shed light on other tasks that require distillation and contrastive learning.

ItrievalKD: an Iterative Retrieval Framework Assisted with Knowledge Distillation for Noisy Text-to-Image Retrieval

LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Retrieval

ConaCLIP: Exploring Distillation of Fully-Connected Knowledge Interaction Graph for Lightweight Text-Image Retrieval

Iterative Uni-modal and Cross-modal Clustered Contrastive Learning for Image-text Retrieval

Dynamic Contrastive Distillation for Image-Text Retrieval

Continual Vision-Language Retrieval Via Dynamic Knowledge Rectification

Knowledge-aware Text-Image Retrieval for Remote Sensing Images

ComKD-CLIP: Comprehensive Knowledge Distillation for Contrastive Language-Image Pre-traning Model

CLIP-CID: Efficient CLIP Distillation via Cluster-Instance Discrimination

Efficient Image-Text Retrieval via Keyword-Guided Pre-Screening

Adaptive CLIP for open-domain 3D model retrieval

LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval

Visual context learning based on textual knowledge for image-text retrieval

CLIP-based Synergistic Knowledge Transfer for Text-based Person Retrieval

Improving Cross-Modal Image-Text Retrieval With Teacher-Student Learning

Where Does the Performance Improvement Come From? - A Reproducibility Concern about Image-Text Retrieval

A Framework for Image Text Retrieval Based on Large Language Model

CLIP-KD: An Empirical Study of CLIP Model Distillation

Focus, Distinguish, and Prompt: Unleashing CLIP for Efficient and Flexible Scene Text Retrieval

Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval