Attention-Based Relation Reasoning Network for Video-Text Retrieval.

N Wang,Z Wang,X Xu,F Shen,Y Yang,HT Shen
DOI: https://doi.org/10.1109/ICME51207.2021.9428215
2021-01-01
Abstract:In the field of video-text matching, there are several potential and effective internal relations within a single modal data, which the existing approaches always ignore. In this paper, we propose a novel model named Attention-based Relation Reasoning Network (ARRN), that can robustly learn and reason the word relations of a sentence and temporal relations between video frames. It can jointly capture the local and global characteristics of video and text, thus significantly improves the performance on video-text retrieval. In ARRN, with global-to-local attention strategy, we could attend to important relations of multi-scales, then learn more reasonable local relation features. These features, generated at distinct levels, are powerful and complementary to each other, allowing us to obtain effective video and text representations by very simple fusion. The extensive experiments on two widely-used video-text datasets MSVD and TGIF show that our proposed ARRN approach establishes a substantial improvement.
What problem does this paper attempt to address?