Dig into Multi-modal Cues for Video Retrieval with Hierarchical Alignment.

Wenzhe Wang,Mengdan Zhang,Runnan Chen,Guanyu Cai,Penghao Zhou,Pai Peng,Xiaowei Guo,Jian Wu,Xing Sun
DOI: https://doi.org/10.24963/ijcai.2021/154
2021-01-01
Abstract:Multi-modal cues presented in videos are usually beneficial for the challenging video-text retrieval task on internet-scale datasets. Recent video retrieval methods take advantage of multi-modal cues by aggregating them to holistic high-level semantics for matching with text representations in a global view. In contrast to this global alignment, the local alignment of detailed semantics encoded within both multi-modal cues and distinct phrases is still not well conducted. Thus, in this paper, we leverage the hierarchical video-text alignment to fully explore the detailed diverse characteristics in multi-modal cues for fine-grained alignment with local semantics from phrases, as well as to capture a high-level semantic correspondence. Specifically, multi-step attention is learned for progressively comprehensive local alignment and a holistic transformer is utilized to summarize multi-modal cues for global alignment. With hierarchical alignment, our model outperforms state-of-the-art methods on three public video retrieval datasets.
What problem does this paper attempt to address?