Abstract:Video-text retrieval has drawn great attention due to the prosperity of online video contents. Most existing methods extract the video embeddings by densely sampling abundant (generally dozens of) video clips, which acquires tremendous computational cost. To reduce the resource consumption, recent works propose to sparsely sample fewer clips from each raw video with a narrow time span. However, they still struggle to learn a reliable video representation with such locally sampled video clips, especially when testing on cross-dataset setting. In this work, to overcome this problem, we sparsely and globally (with wide time span) sample a handful of video clips from each raw video, which can be regarded as different samples of a pseudo video class (i.e., each raw video denotes a pseudo video class). From such viewpoint, we propose a novel Cross-Modal Meta-Transformer (CMMT) model that can be trained in a meta-learning paradigm. Concretely, in each training step, we conduct a cross-modal fine-grained classification task where the text queries are classified with pseudo video class prototypes (each has aggregated all sampled video clips per pseudo video class). Since each classification task is defined with different/new videos (by simulating the evaluation setting), this task-based meta-learning process enables our model to generalize well on new tasks and thus learn generalizable video/text representations. To further enhance the generalizability of our model, we induce a token-aware adaptive Transformer module to dynamically update our model (prototypes) for each individual text query. Extensive experiments on three benchmarks show that our model achieves new state-of-the-art results in cross-dataset video-text retrieval, demonstrating that it has more generalizability in video-text retrieval. Importantly, we find that our new meta-learning paradigm indeed brings improvements under both cross-dataset and in-dataset retrieval settings.

HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval

TransVOS: Video Object Segmentation with Transformers

MCT-VHD: Multi-modal contrastive transformer for video highlight detection

Multi-Scale Temporal Difference Transformer for Video-Text Retrieval

End-to-End Pre-Training With Hierarchical Matching and Momentum Contrast for Text-Video Retrieval

CMMT: Cross-Modal Meta-Transformer for Video-Text Retrieval.

Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval

CONTEXT-AWARE HIERARCHICAL TRANSFORMER FOR FINE-GRAINED VIDEO-TEXT RETRIEVAL

Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations

Text-Image Cross-modal Retrieval Based on Transformer

Memory-enhanced Hierarchical Transformer for Video Paragraph Captioning

Stacked Convolutional Deep Encoding Network for Video-Text Retrieval.

Temporal Multimodal Graph Transformer With Global-Local Alignment for Video-Text Retrieval

BiC-Net: Learning Efficient Spatio-Temporal Relation for Text-Video Retrieval

Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning

HMTV: hierarchical multimodal transformer for video highlight query on baseball

HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling

Contrastive Predictive Coding with Transformer for Video Representation Learning

Hierarchical multimodal transformer to summarize videos

Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification

MH-DETR: Video Moment and Highlight Detection with Cross-modal Transformer