Curriculum Multi-Negative Augmentation for Debiased Video Grounding.

Xiaohan Lan,Yitian Yuan,Hong Chen,Xin Wang,Zequn Jie,Lin Ma,Zhi Wang,Wenwu Zhu
DOI: https://doi.org/10.1609/aaai.v37i1.25204
2023-01-01
Proceedings of the AAAI Conference on Artificial Intelligence
Abstract:Video Grounding (VG) aims to locate the desired segment from a video given a sentence query. Recent studies have found that current VG models are prone to over-rely the groundtruth moment annotation distribution biases in the training set. To discourage the standard VG model's behavior of exploiting such temporal annotation biases and improve the model generalization ability, we propose multiple negative augmentations in a hierarchical way, including cross-video augmentations from clip-/video-level, and self-shuffled augmentations with masks. These augmentations can effectively diversify the data distribution so that the model can make more reasonable predictions instead of merely fitting the temporal biases. However, directly adopting such data augmentation strategy may inevitably carry some noise shown in our cases, since not all of the handcrafted augmentations are semantically irrelevant to the groundtruth video. To further denoise and improve the grounding accuracy, we design a multi-stage curriculum strategy to adaptively train the standard VG model from easy to hard negative augmentations. Experiments on newly collected Charades-CD and ActivityNet-CD datasets demonstrate our proposed strategy can improve the performance of the base model on both i.i.d and o.o.d scenarios.
What problem does this paper attempt to address?