Abstract:Temporal action localization is a challenging task in computer vision, and it tries to find the start time and the end time of the actions and predict their categories. However, compared to temporal action localization, weakly supervised temporal action localization (WTAL) is a more challenging task due to its poor annotations. With only video-level annotation, some background frames, similar to actions, would be classified as actions and produce inaccurate results. In addition, the two-stream fusion problem, ignored previously, also needs to be further considered. To resolve these issues, we propose a novel action saliency and context-aware network (ASCN) for weakly supervised temporal action localization tasks. Specifically, the temporal saliency and context module is designed to enhance the global saliency and context information of the RGB and the flow features to suppress the backgrounds and enhance the actions. In addition, a hybrid attention mechanism using frame differences and two-stream attention is designed to model the local action context information and further enlarge the scores of the potential action regions and suppress the background regions. Finally, to obtain two-stream consistency and solve the fusion problem, we use the similarity loss and a channel self-attention module to adaptively fuse the enhanced RGB and flow features. Extensive experiments demonstrate that ASCN can outperform all of the SOTA WTAL methods on the THUMOS14 dataset and the ActivityNet1.3 dataset with an average mAP that can reach 37.2% on the THUMOS14 dataset and attains an average mAP of 26.3% on the ActivityNet1.3 dataset. On the ActivityNet1.2 dataset, ASCN can also obtain comparable results. Compared with AdapNet (TNNLS20), MMSD (TIP22), and FTCL (CVPR22) on the THUMOS14 dataset, ASCN can outperform them by 13.5%, 2.9%, and 2.8%, respectively.

COWO: Towards Real-Time Spatiotemporal Action Localization in Videos

Real-time spatiotemporal action localization algorithm using improved CNNs architecture

You watch once more: a more effective CNN architecture for video spatio-temporal action localization

YOWOv2: A Stronger yet Efficient Multi-level Detection Framework for Real-time Spatio-temporal Action Detection

Spatiotemporal Multi-Task Network for Human Activity Understanding.

Spatiotemporal Action Recognition in Restaurant Videos

Spatial–Temporal Context-Aware Online Action Detection and Prediction

A fast human action recognition network based on spatio-temporal features

Local-aware spatio-temporal attention network with multi-stage feature fusion for human action recognition

Multi‐scale feature learning and temporal probing strategy for one‐stage temporal action localization

Spatio-Temporal Action Localization in a Weakly Supervised Setting

Action Recognition and Localization with Instance FCNN

Collaborative Spatio-temporal Feature Learning for Video Action Recognition

A weakly supervised CNN model for spatial localization of human activities in unconstraint environment

YOWO-Plus: An Incremental Improvement

Actor-Multi-Scale Context Bidirectional Higher Order Interactive Relation Network for Spatial-Temporal Action Localization.

Learning to Track for Spatio-Temporal Action Localization

A Novel Action Saliency and Context-Aware Network for Weakly-Supervised Temporal Action Localization

Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos

Online Action Tube Detection Via Resolving The Spatio-Temporal Context Pattern

OWL (Observe, Watch, Listen): Audiovisual Temporal Context for Localizing Actions in Egocentric Videos