Diving Into The Relations: Leveraging Semantic and Visual Structures For Video Moment Retrieval

Ziyue Wu,Junyu Gao,Shucheng Huang,Changsheng Xu
DOI: https://doi.org/10.1109/icme51207.2021.9428369
2021-07-05
Abstract:Existing dominant approaches for video moment retrieval task are to learn semantic correlation between a given query and the video. However, these methods rarely explore the fine-grained semantic structure and comprehensive visual structure, leading to insufficient utilization of textual and visual relations. In this paper, we propose a unified framework for video moment retrieval, which considers to simultaneously encode semantic and visual structures. Specifically, a semantic role tree is built to reveal the fine-grained semantic information by generating hierarchical textual embeddings. Then the semantic structure is adopted to facilitate the visual structure learning with a contextual attention-based proposal interaction module. Finally, we adaptively aggregate and obtain the visual-semantic matching information through a multi-level fusion strategy to select the best matching moment proposal. Extensive experiments on two popular benchmarks (Charades-STA and ActivityNet Captions) show that our proposed method achieves state-of-the-art performance. Codes are available in the Supplementary Material.
What problem does this paper attempt to address?