VAL: Visual-Attention Action Localizer

Xiaomeng Song,Yahong Han
DOI: https://doi.org/10.1007/978-3-030-00767-6_32
2018-01-01
Abstract:In this paper, we focus on solving a new task called TALL (Temporal Activity Localization via Language Query). The goal of it is to use nature language queries to localize actions in longer, untrimmed videos. We propose a new model called VAL (Visual-attention Action Localizer) to address it. Specifically, it employs voxel-wise attention and channel-wise attention on last conv-layer feature maps. These two visual attention are designed corresponding to the characteristics of feature maps. They can enhance the visual representations and boost the cross-modal correlation extraction process. Experimental results on TaCoS and Charades-STA datasets both show the effectiveness of our model.
What problem does this paper attempt to address?