Multi-Dimensional Attentive Hierarchical Graph Pooling Network for Video-Text Retrieval.

Dehao Wu,Yi Li,Yinghong Zhang,Yuesheng Zhu
DOI: https://doi.org/10.1109/ICME51207.2021.9428153
2021-01-01
Abstract:Video-text retrieval task has raised increasing attention due to the rapid growth of videos on the Internet. Existing works adopt various networks to encode videos and texts into a common latent space and calculate their similarities. However, most works ignore mining significant frames of videos and the difference among different dimensions in word representations, leading to unsatisfactory retrieval results. In this paper, we propose a Multi-Dimensional Attentive Hierarchical Graph Pooling Network (MAGP) to learn improved representations for video-text retrieval. Specifically, we design a novel hierarchical graph pooling method to extract significant frames in videos and discard unrelated frames, hence the model can learn hierarchical and discriminative video representations. Moreover, a multi-dimensional attention mechanism is utilized in text encoder to strengthen representation ability by dimension-level attention. Experimental results on three video-text datasets demonstrate our MAGP model out-performs the state-of-the-art models.
What problem does this paper attempt to address?