Counterfactually Augmented Event Matching for De-biased Temporal Sentence Grounding

Xun Jiang,Zhuoyuan Wei,Shenshen Li,Xing Xu,Jingkuan Song,Heng Tao Shen
DOI: https://doi.org/10.1145/3664647.3680948
2024-01-01
Abstract:Temporal Sentence Grounding (TSG), which aims to localize events in untrimmed videos with a given language query, has been widely studied in the last decades. However, recently researchers have demonstrated that previous approaches are severely limited in out-of-distribution generalization, thus proposing the De-biased TSG challenge which requires models to overcome weakness towards outlier test samples. In this paper, we design a novel framework, termed Counterfactually-Augmented Event Matching (CAEM), which incorporates counterfactual data augmentation to learn event-query joint representations to resist the training bias. Specifically, it consists of three components: (1) A Temporal Counterfactual Augmentation module that generates counterfactual video-text pairs by temporally delaying events in the untrimmed video, enhancing the model's capacity for counterfactual thinking. (2) An Event-Query Matching model that is used to learn joint representations and predict corresponding matching scores for each event candidate. (3) A Counterfact-Adaptive Framework (CAF) that incorporates the counterfactual consistency rules on the matching process of the same event-query pairs, furtherly mitigating the bias learned from training sets. We conduct thorough experiments on two widely used DTSG datasets, i.e., Charades-CD and ActivityNet-CD, to evaluate our proposed CAEM method. Extensive experimental results show our proposed CAEM method outperforms recent state-of-the-art methods on all datasets. Our implementation code is available at https://github.com/CFM-MSG/CAEM_Code.
What problem does this paper attempt to address?