Abstract:Videos are generally accompanied with multi-modal information such as audio, text, and motion. The multi-modal information is becoming an important cue for understanding video content. How to model the correlation between multi-modalities in videos is still an unsolved problem in video understanding tasks such as video action recognition, video temporal grounding, and video description. In this talk, we focus on two specific video understanding tasks (i.e., cross-modal self-supervised pretraining and temporal grounding) by exploiting the video-text cross modal information. In particular, we notice that videos are naturally accompanied by abundant text information such as YouTube titles, Instagram captions, and Movie scripts. This textual information could serve as a general information to guide us train a multi-modal network, which could be used as a general video representation to be finetuned on the downstream tasks, or as cross-modal matching similarity to be used for video segment retrieval. Specifically, we first present a general cross-modal pair discrimination (CPD) framework to capture this correlation between a video and its associated text. We train our CPD models on both standard video dataset (Kinetics-210k) and uncurated web video dataset (Instagram-300k) to demonstrate its effectiveness. Without further fine-tuning, the learnt models obtain competitive results for action classification on Kinetics under the linear classification protocol. Moreover, our visual model provides an effective initialization to fine-tune on downstream tasks, which yields a remarkable performance gain for action recognition on UCF101 and HMDB51. Our CPD demonstrates that pre-training on a relatively small dataset is able to yield a comparable performance to those methods of using order magnitude more data, which is meaningful and practicable for the scenarios with limited computational facilities. Second, we present a Contrastive and Compatible Matching Network (C2M-Net), to directly model the relations between language queries and video moments in a joint embedding space. This new metric-learning framework enables fully exploiting negative samples from two new aspects: constructing negative pairs from a dual matching scheme and mining negative pairs across different videos. These new negative samples could enhance the joint representation learning of two modalities via contrastive learning to maximize their mutual information. In addition, to precisely rank relatively positive pairs for accurate temporal grounding, we also learn the compatibility between queries and moments by directly regressing their IoU-based similarity. Our C2M-Net yields state-of-the-art performance on three benchmarks of CharadesSTA, TACoS, and ActivityNet-Captions.

SNP-S3: Shared Network Pre-training and Significant Semantic Strengthening for Various Video-Text Tasks

SNP-S 3 : Shared Network Pre-training and Significant Semantic Strengthening for Various Video-Text Tasks

SimVTP: Simple Video Text Pre-training with Masked Autoencoders

Cross-modal Pretraining and Matching for Video Understanding

Stacked Convolutional Deep Encoding Network for Video-Text Retrieval.

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval

Superpixel Semantics Representation and Pre-training for Vision-Language Task

VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending

Stitching Segments and Sentences Towards Generalization in Video-Text Pre-training

Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning.

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation

End-to-End Pre-Training With Hierarchical Matching and Momentum Contrast for Text-Video Retrieval

Global and Local Semantic Completion Learning for Vision-Language Pre-training

Text-Video Retrieval with Global-Local Semantic Consistent Learning

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Alignment

Masked Contrastive Pre-Training for Efficient Video-Text Retrieval

Understanding Chinese Video and Language Via Contrastive Multimodal Pre-Training

Mug-STAN: Adapting Image-Language Pretrained Models for General Video Understanding.

Multi-dataset Pretraining: A Unified Model for Semantic Segmentation

A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language Model