Learning Weighted Video Segments for Temporal Action Localization

Che Sun,Hao Song,Xinxiao Wu,Yunde Jia
DOI: https://doi.org/10.1007/978-3-030-31654-9_16
2019-01-01
Abstract:This paper proposes a novel approach of learning weighted video segments via supervised temporal attention for action localization in untrimmed videos. The learned segment weights represent informativeness of video segments to recognize actions and benefit inferring the boundaries to temporally localize actions. We build a Supervised Temporal Attention Network (STAN) to dynamically learn the weights of video segments, and generate descriptive and discriminative video representations. We use a proposal generator and a classifier to estimate the boundaries of actions and classify the classes of actions, respectively. Extensive experiments are conducted on two public benchmarks THUMOS2014 and ActivityNet1.3. The results demonstrate that our approach achieves substantially better performance than the state-of-the-art methods, verifying the effectiveness of learning weighted video segments.
What problem does this paper attempt to address?