Abstract:Video-text retrieval has drawn great attention due to the prosperity of online video contents. Most existing methods extract the video embeddings by densely sampling abundant (generally dozens of) video clips, which acquires tremendous computational cost. To reduce the resource consumption, recent works propose to sparsely sample fewer clips from each raw video with a narrow time span. However, they still struggle to learn a reliable video representation with such locally sampled video clips, especially when testing on cross-dataset setting. In this work, to overcome this problem, we sparsely and globally (with wide time span) sample a handful of video clips from each raw video, which can be regarded as different samples of a pseudo video class (i.e., each raw video denotes a pseudo video class). From such viewpoint, we propose a novel Cross-Modal Meta-Transformer (CMMT) model that can be trained in a meta-learning paradigm. Concretely, in each training step, we conduct a cross-modal fine-grained classification task where the text queries are classified with pseudo video class prototypes (each has aggregated all sampled video clips per pseudo video class). Since each classification task is defined with different/new videos (by simulating the evaluation setting), this task-based meta-learning process enables our model to generalize well on new tasks and thus learn generalizable video/text representations. To further enhance the generalizability of our model, we induce a token-aware adaptive Transformer module to dynamically update our model (prototypes) for each individual text query. Extensive experiments on three benchmarks show that our model achieves new state-of-the-art results in cross-dataset video-text retrieval, demonstrating that it has more generalizability in video-text retrieval. Importantly, we find that our new meta-learning paradigm indeed brings improvements under both cross-dataset and in-dataset retrieval settings.

Enhanced Cross-Modal Transformer Model for Video Semantic Similarity Measurement

TransVOS: Video Object Segmentation with Transformers

CMMT: Cross-Modal Meta-Transformer for Video-Text Retrieval.

Cross-Modal Learning Based on Semantic Correlation and Multi-Task Learning for Text-Video Retrieval

Semantic Enhanced Video Captioning with Multi-feature Fusion

Transformer-Based Cross-Modal Information Fusion Network for Semantic Segmentation

CMFF_VS：A Video Summarization Extraction Model Based on Cross-modal Feature Fusion

Cross-modal Semantic Interference Suppression for image-text matching

Semantic association enhancement transformer with relative position for image captioning

Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation

Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification

Cross-modal Token Selection for Video Understanding.

Multimodal attention-based transformer for video captioning

Video-Context Aligned Transformer for Video Question Answering

Transformer-Based Interactive Multi-Modal Attention Network for Video Sentiment Detection

Bridging Asymmetry Between Image and Video: Cross-modality Knowledge Transfer Based on Learning from Video

Interaction augmented transformer with decoupled decoding for video captioning

Transformer Video Classification algorithm based on video token-to-token.

Tencent-MVSE: A Large-Scale Benchmark Dataset for Multi-Modal Video Similarity Evaluation

SVT: Supertoken Video Transformer for Efficient Video Understanding

Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment Analysis in Videos