Weakly-Supervised Temporal Action Localization with Multi-Head Cross-Modal Attention

Hao Ren,Haoran Ren,Wu Ran,Hong Lu,Cheng Jin
DOI: https://doi.org/10.1007/978-3-031-20868-3_21
2022-01-01
Abstract:Weakly-supervised temporal action localization seeks to localize temporal boundaries of actions while concurrently identifying their categories using only video-level category labels during training. Among the existing methods, the modal cooperation methods have achieved great success by providing pseudo supervision signals to RGB and Flow features. However, most of these methods ignore the cross-correlation between modal characteristics which can help them learn better features. By considering the cross-correlation, we propose a novel multi-head cross-modal attention mechanism to explicitly model the cross-correlation of modal features. The proposed method collaboratively enhances RGB and Flow features through a cross-correlation matrix. In this way, the enhanced features for each modality encode the inter-modal information, while preserving the exclusive and meaningful intra-modal characteristics. Experimental results on three recent methods demonstrate that the proposed Multi-head Cross-modal Attention (MCA) mechanism can significantly improve the performance of these methods, and even achieve state-of-the-art results on the THUMOS14 and ActivityNet1.2 datasets.
What problem does this paper attempt to address?