Few-Shot Temporal Sentence Grounding Via Memory-Guided Semantic Learning

Daizong Liu,Pan Zhou,Zichuan Xu,Haozhao Wang,Ruixuan Li
DOI: https://doi.org/10.1109/tcsvt.2022.3223725
IF: 5.859
2023-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Temporal sentence grounding (TSG) is an important yet challenging task in video-based information retrieval. Given an untrimmed video input, it requires the machine to predict the interested video segment semantically related to a given sentence query. Most existing TSG methods train well-designed deep networks to align the semantic between video-query pairs for activity grounding with a large amount of data. However, we argue that these works easily capture the selection biases of video-query pairs in a dataset rather than showing the robust reasoning abilities to handle the rarely appeared pairs (i.e., few-shot contents). To alleviate such limitation of the off-balance data distribution during the network training, in this paper, we propose a novel memory-augmented network called Memory-Guided Semantic Learning Network (MGSL-Net) to handle the few-shot TSG task for enhancing the model generalization ability. Specifically, given the matched video-query input, we first employ a graph attentive cross-modal interaction module to align their semantics in a cycle-consistent manner. Then, we develop the memory modules in both video and query domains to record the cross-modal shared semantic features in the domain-specific persistent memory. At last, a heterogeneous attention module is utilized to integrate the memory-enhanced multi-modal features in both video and query domains with further feature calibration. During training, the memory modules are dynamically associated with both common and rare cases to memorize all appeared contents, alleviating the issue of forgetting the few-shot contents. Therefore, in testing, the rare cases can be enhanced by retrieving the stored memories, improving the generalization ability of the model. Experimental results on three benchmarks (ActivityNet Caption, Charades-STA and TACoS) show the superiority of our method on both effectiveness and efficiency.
What problem does this paper attempt to address?