Abstract:Temporal action localization is a challenging task in computer vision, and it tries to find the start time and the end time of the actions and predict their categories. However, compared to temporal action localization, weakly supervised temporal action localization (WTAL) is a more challenging task due to its poor annotations. With only video-level annotation, some background frames, similar to actions, would be classified as actions and produce inaccurate results. In addition, the two-stream fusion problem, ignored previously, also needs to be further considered. To resolve these issues, we propose a novel action saliency and context-aware network (ASCN) for weakly supervised temporal action localization tasks. Specifically, the temporal saliency and context module is designed to enhance the global saliency and context information of the RGB and the flow features to suppress the backgrounds and enhance the actions. In addition, a hybrid attention mechanism using frame differences and two-stream attention is designed to model the local action context information and further enlarge the scores of the potential action regions and suppress the background regions. Finally, to obtain two-stream consistency and solve the fusion problem, we use the similarity loss and a channel self-attention module to adaptively fuse the enhanced RGB and flow features. Extensive experiments demonstrate that ASCN can outperform all of the SOTA WTAL methods on the THUMOS14 dataset and the ActivityNet1.3 dataset with an average mAP that can reach 37.2% on the THUMOS14 dataset and attains an average mAP of 26.3% on the ActivityNet1.3 dataset. On the ActivityNet1.2 dataset, ASCN can also obtain comparable results. Compared with AdapNet (TNNLS20), MMSD (TIP22), and FTCL (CVPR22) on the THUMOS14 dataset, ASCN can outperform them by 13.5%, 2.9%, and 2.8%, respectively.

Weakly-Supervised Action Localization by Hierarchical Attention Mechanism with Multi-Scale Fusion Strategies

Weakly-Supervised Action Localization by Hierarchically-structured Latent Attention Modeling

Weakly-Supervised Temporal Action Localization Based on Attention Regularization

Entropy Guided Attention Network for Weakly-Supervised Action Localization.

Weakly-supervised Action Localization Via Hierarchical Mining.

Weakly-Supervised Action Localization by Generative Attention Modeling

Weakly-Supervised Temporal Action Localization with Multi-Head Cross-Modal Attention

Adaptive Mutual Supervision for Weakly-Supervised Temporal Action Localization

Structured Attention Composition for Temporal Action Localization

Multi-Dimensional Attention with Similarity Constraint for Weakly-Supervised Temporal Action Localization

SAPS: Self-Attentive Pathway Search for weakly-supervised action localization with background-action augmentation

Cascaded Pyramid Mining Network for Weakly Supervised Temporal Action Localization

Weakly-supervised Temporal Action Localization with Adaptive Clustering and Refining Network

Modeling Sub-Actions for Weakly Supervised Temporal Action Localization

Completeness Modeling and Context Separation for Weakly Supervised Temporal Action Localization

A Novel Action Saliency and Context-Aware Network for Weakly-Supervised Temporal Action Localization

Generalized Uncertainty-Based Evidential Fusion with Hybrid Multi-Head Attention for Weak-Supervised Temporal Action Localization

Weakly supervised temporal action localization with actionness-guided false positive suppression

Weakly-Supervised Temporal Action Localization with Regional Similarity Consistency

Action-Semantic Consistent Knowledge for Weakly-Supervised Action Localization

Transferable Knowledge-Based Multi-Granularity Fusion Network for Weakly Supervised Temporal Action Detection