Conditional Video-Text Reconstruction Network with Cauchy Mask for Weakly Supervised Temporal Sentence Grounding

Jueqi Wei,Yuanwu Xu,Mohan Chen,Yuejie Zhang,Rui Feng,Shang Gao
DOI: https://doi.org/10.1109/ICME55011.2023.00261
2023-01-01
Abstract:Temporal sentence grounding aims to detect the target segment most related to a given query in an untrimmed video. To alleviate the expensive annotation cost for temporal labels, researchers paid more attention to weakly supervised setting. Prior studies neglected the utilization of video representation reconstruction, which led to an unbalanced alignment learning. Moreover, they used different strategies to generate proposals which ignored the temporal structure in a query. In this paper, we propose a novel Conditional Video-Text Reconstruction Network (CVTRN). It supports conditional reconstruction of video and text representation. Specifically, video and text features are fused to compute semantic alignment, which is the condition of reconstruction. A new mask strategy for mask conditioned sentence reconstruction is also devised. This strategy focuses more on boundary regions than the widely used Gaussian mask in previous methods. Experimental results on two public benchmark datasets show that our CVTRN outperforms the state-of-the-art methods.
What problem does this paper attempt to address?