Abstract:Although the vision-and-language pretraining (VLP) equipped cross-modal image-text retrieval (ITR) has achieved remarkable progress in the past two years, it suffers from a major drawback: the ever-increasing size of VLP models restrict its deployment to real-world search scenarios (where the high latency is unacceptable). To alleviate this problem, we present a novel plug-in dynamic contrastive distillation (DCD) framework to compress the large VLP models for the ITR task. Technically, we face the following two challenges: 1) the typical uni-modal metric learning approach is difficult to directly apply to cross-modal task, due to the limited GPU memory to optimize too many negative samples during handling cross-modal fusion features. 2) it is inefficient to static optimize the student network from different hard samples, which have different effects on distillation learning and student network optimization. We try to overcome these challenges from two points. First, to achieve multi-modal contrastive learning, and balance the training costs and effects, we propose to use a teacher network to estimate the difficult samples for students, making the students absorb the powerful knowledge from pre-trained teachers, and master the knowledge from hard samples. Second, to dynamic learn from hard sample pairs, we propose dynamic distillation to dynamically learn samples of different difficulties, from the perspective of better balancing the difficulty of knowledge and students' self-learning ability. We successfully apply our proposed DCD strategy on two state-of-the-art vision-language pretrained models, i.e. ViLT and METER. Extensive experiments on MS-COCO and Flickr 30 K benchmarks show the effectiveness and efficiency of our DCD framework. Encouragingly, we can speed up the inference at least 129 × compared to the existing ITR models. We further provide in-depth analyses and discussions that explain where the performance improvement comes from. We hope our work can shed light on other tasks that require distillation and contrastive learning.

SPSD: Similarity-preserving self-distillation for video–text retrieval

Stacked Convolutional Deep Encoding Network for Video-Text Retrieval.

TeachText: CrossModal Generalized Distillation for Text-Video Retrieval

Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning.

Dynamic Contrastive Distillation for Image-Text Retrieval

SimVTP: Simple Video Text Pre-training with Masked Autoencoders

Cross-Modal Learning Based on Semantic Correlation and Multi-Task Learning for Text-Video Retrieval

Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning

Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations

Text-Video Retrieval with Global-Local Semantic Consistent Learning

Video-Language Alignment via Spatio-Temporal Graph Transformer

TSVT: Token Sparsification Vision Transformer for Robust RGB-D Salient Object Detection

Self-Supervised Video Similarity Learning

Video Retrieval with Similarity-Preserving Deep Temporal Hashing

Learning Spatiotemporal Features via Video and Text Pair Discrimination

Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation

Masked Contrastive Pre-Training for Efficient Video-Text Retrieval

SSVMR: Saliency-Based Self-Training for Video-Music Retrieval.

UATVR: Uncertainty-Adaptive Text-Video Retrieval

BiC-Net: Learning Efficient Spatio-Temporal Relation for Text-Video Retrieval

Visual Spatio-temporal Relation-enhanced Network for Cross-modal Text-Video Retrieval