Multiple cross-attention for video-subtitle moment retrieval

Hao Fu,Hongxing Wang
DOI: https://doi.org/10.1016/j.patrec.2022.02.016
IF: 4.757
2022-04-01
Pattern Recognition Letters
Abstract:Given a natural language query, video-subtitle moment retrieval (VSMR) aims at localizing a short video moment from a video with subtitles. Different from the extensively studied video moment retrieval task locating vision moments to match the text query, VSMR is a more challenging task because the retrieval results have to contain both vision and subtitle contents, which needs a deep understanding of one more subtitle modality in addition to the query text and the video itself. Towards this end, we design a mutually guided cross-attention block by uniting multiple self-attention units and guided-attention units with successively mutual connections, and therefrom propose a novel Multiple Cross-Attention (MCA) network for multi-modal interaction and matching. Through such an attention interaction among multiple modalities, the proposed MCA can favorably model both the query-video relations and query-subtitle relations in word-by-clip level for VSMR. We quantitatively and qualitatively evaluate our proposed MCA on TVR, which is the most challenging VSMR dataset available. Empirical evidences demonstrate that our method outperforms the state-of-the-art ones.
computer science, artificial intelligence
What problem does this paper attempt to address?