Abstract:Video-text retrieval has been a crucial and fundamental task in multi-modal research. The development of video-text retrieval has been considerably promoted by large-scale multi-modal contrastive pre-training, which primarily focuses on coarse-grained or fine-grained contrast. However, cross-grained contrast, which is the contrast between coarse-grained representations and fine-grained representations, has rarely been explored in prior research. Compared with fine-grained or coarse-grained contrasts, cross-grained contrast calculate the correlation between coarse-grained features and each fine-grained feature, and is able to filter out the unnecessary fine-grained features guided by the coarse-grained feature during similarity calculation, thus improving the accuracy of retrieval. To this end, this paper presents a novel multi-grained contrastive model, namely X-CLIP, for video-text retrieval. However, another challenge lies in the similarity aggregation problem, which aims to aggregate fine-grained and cross-grained similarity matrices to instance-level similarity. To address this challenge, we propose the Attention Over Similarity Matrix (AOSM) module to make the model focus on the contrast between essential frames and words, thus lowering the impact of unnecessary frames and words on retrieval results. With multi-grained contrast and the proposed AOSM module, X-CLIP achieves outstanding performance on five widely-used video-text retrieval datasets, including MSR-VTT (49.3 R@1), MSVD (50.4 R@1), LSMDC (26.1 R@1), DiDeMo (47.8 R@1) and ActivityNet (46.2 R@1). It outperforms the previous state-of-theart by +6.3%, +6.6%, +11.1%, +6.7%, +3.8% relative improvements on these benchmarks, demonstrating the superiority of multi-grained contrast and AOSM.

CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Alignment

CLIP4Caption: CLIP for Video Caption

Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision

How Much Can CLIP Benefit Vision-and-Language Tasks?

VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions

Transferring Image-CLIP to Video-Text Retrieval via Temporal Relations

ProtoCLIP: Prototypical Contrastive Language Image Pretraining

Jina CLIP: Your CLIP Model Is Also Your Text Retriever

X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval

From Scarcity to Efficiency: Improving CLIP Training via Visual-enriched Captions

VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models

Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese

Contrastive Localized Language-Image Pre-Training

VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending

What Makes CLIP More Robust to Long-Tailed Pre-Training Data? A Controlled Study for Transferable Insights

Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring.