STDNet: Spatio-Temporal Decomposed Network for Video Grounding

Yuanwu Xu,Yuejie Zhang,Rui Feng,Rui-Wei Zhao,Tao Zhang,Xuequan Lu,Shang Gao
DOI: https://doi.org/10.1109/icme52920.2022.9859855
2022-01-01
Abstract:Previous methods for video grounding treated either the query or the video as a whole, while neglecting their respective semantics in the orthogonal space and time dimensions. Since spatial semantics appears frequently in a video, temporal semantics is more discriminative and deserves more attention. Based on such considerations, we propose a novel Spatio-Temporal Decomposed Network (STDNet) which decomposes the query and the video into their spatial and temporal semantics, respectively. Specifically, spatial and temporal words are selected from the query, and the video is split into two pathways. Spatial cross-modal attention is computed first and serves as prior knowledge for temporal attention. A new localization strategy is also devised which regresses the segment's start conditioned on the end and essentially breaks the independence assumption made in previous methods. Experimental results on three public benchmark datasets show that our STDNet outperforms the state-of-the-art methods.
What problem does this paper attempt to address?