Abstract:Video-text retrieval has been a crucial and fundamental task in multi-modal research. The development of video-text retrieval has been considerably promoted by large-scale multi-modal contrastive pre-training, which primarily focuses on coarse-grained or fine-grained contrast. However, cross-grained contrast, which is the contrast between coarse-grained representations and fine-grained representations, has rarely been explored in prior research. Compared with fine-grained or coarse-grained contrasts, cross-grained contrast calculate the correlation between coarse-grained features and each fine-grained feature, and is able to filter out the unnecessary fine-grained features guided by the coarse-grained feature during similarity calculation, thus improving the accuracy of retrieval. To this end, this paper presents a novel multi-grained contrastive model, namely X-CLIP, for video-text retrieval. However, another challenge lies in the similarity aggregation problem, which aims to aggregate fine-grained and cross-grained similarity matrices to instance-level similarity. To address this challenge, we propose the Attention Over Similarity Matrix (AOSM) module to make the model focus on the contrast between essential frames and words, thus lowering the impact of unnecessary frames and words on retrieval results. With multi-grained contrast and the proposed AOSM module, X-CLIP achieves outstanding performance on five widely-used video-text retrieval datasets, including MSR-VTT (49.3 R@1), MSVD (50.4 R@1), LSMDC (26.1 R@1), DiDeMo (47.8 R@1) and ActivityNet (46.2 R@1). It outperforms the previous state-of-theart by +6.3%, +6.6%, +11.1%, +6.7%, +3.8% relative improvements on these benchmarks, demonstrating the superiority of multi-grained contrast and AOSM.

M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval

ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition.

M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action Recognition

MV-Adapter: Multimodal Video Transfer Learning for Video Text Retrieval

Cross-Modal Adapter for Text-Video Retrieval

Masked Contrastive Pre-Training for Efficient Video-Text Retrieval

End-to-End Pre-Training With Hierarchical Matching and Momentum Contrast for Text-Video Retrieval

RCAT: Retentive CLIP Adapter Tuning for Improved Video Recognition

VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Alignment

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval

Mug-STAN: Adapting Image-Language Pretrained Models for General Video Understanding.

SNP-S3: Shared Network Pre-training and Significant Semantic Strengthening for Various Video-Text Tasks

Learning Text-to-Video Retrieval from Image Captioning

RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training

Video Editing for Video Retrieval

UATVR: Uncertainty-Adaptive Text-Video Retrieval

SNP-S 3 : Shared Network Pre-training and Significant Semantic Strengthening for Various Video-Text Tasks

CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning

Automatic Speech Recognition Post-Processing for Readability: Task, Dataset and a Two-Stage Pre-Trained Approach