Semantic Video Moment Retrieval by Temporal Feature Perturbation and Refinement

Weitong Cai,Jiabo Huang,Jian Hu,Shaogang Gong,Hailin Jin,Yang Liu
DOI: https://doi.org/10.1109/icprs62101.2024.10677814
2024-01-01
Abstract:Video moment retrieval (VMR) aims to locate temporal activities in untrimmed videos by sentence queries, facing a temporal bias problem. VMR models tend to over-rely on statistical regularities, instead of cross-modal semantics, and perform poorly against different distributions. Existing attempts clip/reorder video segments or overlook some samples, leading to sample integrity break and information waste, which are inappropriate when utilizing limited VMR datasets from labor-intensive labeling. In this work, without sacrificing samples’ inherent value to balance performances, we develop a novel Temporal feature Perturbation and Refinement (TPR) method to augment each sample. Specifically, we perturb frame features by manipulating their time-level statistics, to diversify temporal distributions and promote more generalizable cross-modal learning. Considering the plausible moment boundary shifts brought by perturbation, we further refine final predictions by augmenting time-point labels to candidate endpoint sets with designed query triplets. Experiments show TPR’s superiority on various temporal distributions.
What problem does this paper attempt to address?