Self-Supervised Graph Convolution for Video Moment Retrieval

Xiwen Hu,Guolong Wang,Shimin Shan,Yu Liu,Jiangquan Li
DOI: https://doi.org/10.1007/978-3-031-44204-9_34
2023-01-01
Abstract:Video Moment Retrieval is a task locating a moment from an untrimmed video that are relevant to a given query. It is a highly challenging multi-modal task due to biased annotations and complex cross-model interaction. In this paper, we propose Self-Supervised Graph Convolution Network (SSGCN) for video moment retrieval. For biased annotations, we design a self-supervised auxiliary task to mine feature representation of inherit video and text information by randomly dropout moment-text relation. For complex cross-modal interaction, we use two Graph Convolutional Networks to obtain feature representations of both the video and text modalities. The feature representations of the two modalities are then used to acquire cross-modal information through cross-attention layers, which is treated as implicit graph matching edges to update the graph neural network. The effectiveness of the proposed model is validated through extensive experiments.
What problem does this paper attempt to address?