A Hybird Alignment Loss for Temporal Moment Localization with Natural Language

Chao Guo,Daizong Liu,Pan Zhou
DOI: https://doi.org/10.1109/icme52920.2022.9859675
2022-01-01
Abstract:This paper addresses the problem of temporal moment localization with natural language, which aims to localize a target video moment according to a language description. A key challenge in this task is how to learn effective alignment between vision and language features extracted from an untrimmed video and a sentence description. Traditional methods generally utilize the vanilla attention mechanism to simulate soft alignment between extracted video and sentence features, however, they lack the training signal to alleviate the domain gap between video and sentence domains. In this paper, we propose a novel cross-domain alignment module to deeply align the features in two domains, yielding to learn better representations. We evaluate our model on three public datasets: Charades-STA, ActivityNet Captions and TACoS, and the experiments prove that our proposed cross-domain alignment module can bring significant improvement, and can be also applied to other baseline models.
What problem does this paper attempt to address?