Abstract:Weakly supervised temporal action localization (WTAL) aims to classify and localize temporal boundaries of actions for the video, given only video-level category labels in the training datasets. Due to the lack of boundary information during training, existing approaches formulate WTAL as a classification problem, i.e., generating the temporal class activation map (T-CAM) for localization. However, with only classification loss, the model would be suboptimized, i.e., the action-related scenes are enough to distinguish different class labels. Regarding other actions in the action-related scene (i.e., the scene same as positive actions) as co-scene actions, this suboptimized model would misclassify the co-scene actions as positive actions. To address this misclassification, we propose a simple yet efficient method, named bidirectional semantic consistency constraint (Bi-SCC), to discriminate the positive actions from co-scene actions. The proposed Bi-SCC first adopts a temporal context augmentation to generate an augmented video that breaks the correlation between positive actions and their co-scene actions in the inter-video. Then, a semantic consistency constraint (SCC) is used to enforce the predictions of the original video and augmented video to be consistent, hence suppressing the co-scene actions. However, we find that this augmented video would destroy the original temporal context. Simply applying the consistency constraint would affect the completeness of localized positive actions. Hence, we boost the SCC in a bidirectional way to suppress co-scene actions while ensuring the integrity of positive actions, by cross-supervising the original and augmented videos. Finally, our proposed Bi-SCC can be applied to current WTAL approaches and improve their performance. Experimental results show that our approach outperforms the state-of-the-art methods on THUMOS14 and ActivityNet. The code is available at https://github.com/lgzlIlIlI/BiSCC.

Snippet-to-Prototype Contrastive Consensus Network for Weakly Supervised Temporal Action Localization

Weakly Supervised Temporal Action Localization through Contrast based Evaluation Networks

Weakly Supervised Temporal Action Localization via Representative Snippet Knowledge Propagation

Action-Semantic Consistent Knowledge for Weakly-Supervised Action Localization

CoLA: Weakly-Supervised Temporal Action Localization with Snippet Contrastive Learning

Discriminative Action Snippet Propagation Network for Weakly Supervised Temporal Action Localization

Weakly-Supervised Temporal Action Localization with Regional Similarity Consistency

Weakly-Supervised Temporal Action Localization by Progressive Complementary Learning

Progressive Enhancement Network with Pseudo Labels for Weakly Supervised Temporal Action Localization

A Novel Action Saliency and Context-Aware Network for Weakly-Supervised Temporal Action Localization

Adaptive Two-Stream Consensus Network for Weakly-Supervised Temporal Action Localization

SGLP-Net: Sparse Graph Label Propagation Network for Weakly-Supervised Temporal Action Localization

Proposal-Based Multiple Instance Learning for Weakly-Supervised Temporal Action Localization

Weakly Supervised Temporal Action Localization With Bidirectional Semantic Consistency Constraint

Fine-grained Temporal Contrastive Learning for Weakly-supervised Temporal Action Localization

Weakly-supervised Temporal Action Localization with Adaptive Clustering and Refining Network

Consistency Prototype Module and Motion Compensation for Few-Shot Action Recognition (CLIP-CP$\mathbf{M^2}$C)

Completeness Modeling and Context Separation for Weakly Supervised Temporal Action Localization

Adaptive Mutual Supervision for Weakly-Supervised Temporal Action Localization

Weakly-Supervised Temporal Action Localization with Bidirectional Semantic Consistency Constraint