Gazing After Glancing: Edge Information Guided Perception Network for Video Moment Retrieval

Zhanghao Huang,Yi Ji,Ying Li,Chunping Liu
DOI: https://doi.org/10.1109/lsp.2024.3403533
2024-06-12
IEEE Signal Processing Letters
Abstract:Video Moment Retrieval (VMR) is a challenging task aimed at locating video segments in untrimmed videos through semantic matching of the given queries. Due to the fact that most existing methods neglect the valuable clues of edge information, it is difficult to precisely pinpoint the target segment as the target moment is complex. To this end, this paper proposes a novel perception network, Gazing After Glancing(GAG), to utilize edge information. Inspired by human reading habits, we propose a localization strategy of glancing and gazing, and using this strategy, we divide the proposed VMR task with the perceptual network into two stages, "glancing" and "gazing". The glancing stage utilizes a commonly used coarse-grained feature encoder and an edge-guided span predictor to locate the approximate area. The gazing stage leverages the edge information extracted from the result of "glancing" to recalibrate the query feature. Specifically, we propose an edge-guided highlighting block to recalibrate the encoded query feature according to the visual edge semantic information. Then the refined query feature and visual feature are utilized by the edge-guided span predictor. Moreover, we employ the distillation to enhance the ability of the coarse-grained feature encoder. Experimental results on two widely used ActivityNet Captions and TACoS datasets show that the proposed edge information guided two-stage VMR method effectively improves the localization accuracy.
engineering, electrical & electronic
What problem does this paper attempt to address?