Learning Semantic Alignment with Global Modality Reconstruction for Video-Language Pre-training Towards Retrieval.

Mingchao Li,Xiaoming Shi,Haitao Leng,Wei Zhou,Hai-Tao Zheng,Kuncai Zhang
DOI: https://doi.org/10.1609/aaai.v37i1.25222
2023-01-01
Proceedings of the AAAI Conference on Artificial Intelligence
Abstract:Video-language pre-training for text-based video retrieval tasks is vitally important. Previous pre-training methods suffer from semantic misalignments. The reason is that these methods ignore sequence alignments but focus on critical token alignment. To alleviate the problem, we propose a video-language pre-training framework, termed video-language pre-training For lEarning sEmantic aLignments (FEEL), to learn semantic alignments at the sequence level. Specifically, the global modality reconstruction and the cross-modal self-contrasting method are utilized to learn the alignments at the sequence level better. Extensive experimental results demonstrate the effectiveness of FEEL on text-based video retrieval and text-based video corpus moment retrieval.
What problem does this paper attempt to address?