Fine-Grained Modality Relation-Aware Network for Video Moment Retrieval

Yibo Zhao,Zan Gao,Chunjie Ma,Weili Guan,Riwei Wang,Shengyong Chen
DOI: https://doi.org/10.1109/tcsvt.2024.3494744
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Video moment retrieval (VMR) involves localizing video segments semantically aligned with given queries within videos. Despite the development of numerous methods for VMR in recent years, there remains a need to better incorporate fine-grained modality relation-aware information both in intra-modality and cross-modality. To address these challenges, we propose a Fine-grained Modality Relation-Aware Network (FMRN) tailored for the video moment retrieval task. FMRN effectively explores fine-grained modality relation-aware information within text queries, videos, and proposals. Our approach begins with a semantic graph encoder to capture deep semantic relations in intra-modality. Besides, we introduce a novel fine-grained cross-modality interaction module comprising a cross-similarity weighting module, an intra-modality weighting module, and an adaptive fusion module. These components comprehensively exploit fine-grained relation information within intra-modality and cross-modality contexts. Specifically, the cross-similarity weighting module leverages similarities between text queries and video snippets, as well as between videos and query words. The intra-modality weighting module determines the importance of words and snippets, while the adaptive fusion module combines cross-similarity weighting and intra-modality weighting. Additionally, we design a proposal relation module to enhance retrieval by capturing fine-grained proposals-relation information in videos. Extensive experiments demonstrate that the proposed method can outperform all state-of-the-art methods on the TACoS dataset and obtain comparable results on the Charades-STA and ActivityNet-Captions datasets. Compared with MCMN (TCSVT2024) and DPHANet (TMM2024), FMRN can achieve average improvements of 3.61 % and 5.44 % on the TACoS dataset, respectively 1 .
What problem does this paper attempt to address?