Learning Unified Video-Language Representations via Joint Modeling and Contrastive Learning for Natural Language Video Localization.

Chenhao Cui,Xinnian Liang,Shuangzhi Wu,Zhoujun Li
DOI: https://doi.org/10.1109/IJCNN54540.2023.10191104
2023-01-01
Abstract:Natural language video localization (NLVL) aims to locate the matching span relevant to a given query sentence from an untrimmed video. This task requires not only understanding video and text but also aligning the semantics between video and language. Existing methods obtain vision-language representations via separate encoders, cross-modal interactions are not fine-grained enough, and the semantics are not fully aligned. In this paper, we address the vision-language alignment via joint modeling and contrastive learning. We propose a unified VideoLanguage Representation Network (UniNet), employing a transformer encoder to learn vision-language representations aligned. Simultaneously taking video and text as input, the encoder jointly learns the representations of both and captures the inter-relations between video and text. Then the representations are used by the predictor to locate the grounding video span. Besides, we train our model with contrastive learning to enhance vision-language representations in the training stage. Experiments on three benchmark datasets show that UniNet outperforms the baseline methods and adopting unified representation and contrastive learning can improve vision-language semantic alignment.
What problem does this paper attempt to address?