Weakly-Supervised Action Localization by Hierarchical Attention Mechanism with Multi-Scale Fusion Strategies

Yu Wang,Shengjie Zhao
DOI: https://doi.org/10.1109/icme57554.2024.10688175
2024-01-01
Abstract:Weakly-supervised temporal action localization focuses on locating action intervals when merely video-level supervised signals are available. Conventional methods mostly rely on the attention framework, which generates a set of scores indicating the confidence that the video snippet belongs to the foreground, the background, and the context, respectively. However, such methods fail to consider the structural properties of snippet-level features when generating attention scores, and these structural properties are critical for capturing contextual information in temporal tasks. To this end, we propose a hierarchical attention generation mechanism with multi-scale fusion strategies to model such structural information. Besides, to resolve action-context confusion issues that are quite intractable in weakly-supervised action localization tasks, metric learning is further introduced into our framework to suppress context features from approaching action features, while encouraging them to be close to background features. Finally, our model is evaluated on THUMOS14 and ActivityNet1.3 benchmarks, and the results demonstrate that the proposed approach achieves desirable performance.
What problem does this paper attempt to address?