An Adversarial Video Moment Retrieval Algorithm

Mohan Jia,Zhongjian Dai,Yaping Dai,Zhiyang Jia
DOI: https://doi.org/10.23919/ccc55666.2022.9902146
2022-01-01
Abstract:In one-stage methods for video moment retrieval, the common representations indirectly supervised by boundary prediction fail to fully preserve the inherent characteristic of the video and query, which limits the retrieval accuracy. To solve this problem, an Adversarial Video Moment Retrieval (AVMR) algorithm is proposed to learn the common representations with modality invariance and cross-modal similarity. AVMR is implemented through the process of adversarial learning between a feature projector and a modality classifier. The feature projector tries to generate a modality-invariant common representation and to confuse the modality classifier. The modality classifier tries to discriminate between different modalities based on the generated representation by the feature projector. The triplet constraints are further imposed on the feature projector to preserve the underlying cross-modal semantic structure of data. The experimental results show that AVMR surpasses the baseline Attentive Cross-modal Relevance Matching (ACRM) by 1.10% and 1.73% in the “mIoU” metric on two public datasets Charades-STA and TACoS, respectively.
What problem does this paper attempt to address?