Mixup-Augmented Temporally Debiased Video Grounding with Content-Location Disentanglement.

Xin Wang,Zihao Wu,Hong Chen,Xiaohan Lan,Wenwu Zhu
DOI: https://doi.org/10.1145/3581783.3612401
2023-01-01
Abstract:Video Grounding (VG), has drawn widespread attention over the past few years, and numerous studies have been devoted to improving performance on various VG benchmarks. Nevertheless, the label annotation procedures in VG produce imbalanced query-moment-label distributions in the datasets, which severely deteriorate the learning model's capability of truly understanding the video contents. Existing works on debiased VG either focus on adjusting the learning model or conducting video-level augmentation, failing to handle the temporal bias issue caused by imbalanced query-moment-label distributions. In this paper, we propose a Disentangled Feature Mixup (DFM) framework for debiased VG, which is capable of performing unbiased grounding to tackle the temporal bias issue. Specifically, a feature-mixup augmentation strategy is designed to generate new (text, location) pairs with diverse temporal distributions via jointly augmenting the representation of text queries and the location labels. This strategy encourages making prediction based on more diverse data samples with balanced query-moment-label distributions. Furthermore, we also design a content-location disentanglement module to disentangle the representations of the temporal information and content information in videos, which is able to remove the spurious effect of temporal biases on video representation. Given that our proposed DFM framework conducts feature-level augmentation and disentanglement, it is model-agnostic and can be applied to most baselines simply yet effectively. Extensive experiments show that our proposed DFM framework is able to significantly outperform baseline models in various metrics under both independent identical distribution (i.i.d.) and out-of-distribution (o.o.d.) scenes, especially in scenarios with annotation distribution changes.
What problem does this paper attempt to address?