Abstract:Weakly supervised temporal action localization aims to locate the temporal boundaries of action instances in untrimmed videos using video-level labels and assign them the corresponding action category. Generally, it is solved by a pipeline called "localization-by-classification", which finds the action instances by classifying video snippets. However, since this approach optimizes the video-level classification objective, the generated activation sequences often suffer interference from class-related scenes, resulting in a large number of false positives in the prediction results. Many existing works treat background as an independent category, forcing models to learn to distinguish background snippets. However, under weakly supervised conditions, the background information is fuzzy and uncertain, making this method extremely difficult. To alleviate the impact of false positives, we propose a new actionness-guided false positive suppression framework. Our method seeks to suppress false positive backgrounds without introducing the background category. Firstly, we propose a self-training actionness branch to learn class-agnostic actionness, which can minimize the interference of class-related scene information by ignoring the video labels. Secondly, we propose a false positive suppression module to mine false positive snippets and suppress them. Finally, we introduce the foreground enhancement module, which guides the model to learn the foreground with the help of the attention mechanism as well as class-agnostic actionness. We conduct extensive experiments on three benchmarks (THUMOS14, ActivityNet1.2, and ActivityNet1.3). The results demonstrate the effectiveness of our method in suppressing false positives and it achieves the state-of-the-art performance. Code: https://github.com/lizhilin-ustc/AFPS.

Self-attention relational modeling and background suppression for weakly supervised temporal action localization

Entropy Guided Attention Network for Weakly-Supervised Action Localization.

Weakly supervised temporal action localization with actionness-guided false positive suppression

Weakly-Supervised Temporal Action Localization Based on Attention Regularization

Weakly-Supervised Action Localization by Hierarchically-structured Latent Attention Modeling

ACM-Net: Action Context Modeling Network for Weakly-Supervised Temporal Action Localization

Weakly-Supervised Action Localization by Hierarchical Attention Mechanism with Multi-Scale Fusion Strategies

SAPS: Self-Attentive Pathway Search for weakly-supervised action localization with background-action augmentation

Weakly-Supervised Temporal Action Localization by Inferring Salient Snippet-Feature

Learning Transferable Self-attentive Representations for Action Recognition in Untrimmed Videos with Weak Supervision

Completeness Modeling and Context Separation for Weakly Supervised Temporal Action Localization

A Joint Model for Action Localization and Classification in Untrimmed Video with Visual Attention

Forcing the Whole Video As Background: an Adversarial Learning Strategy for Weakly Temporal Action Localization

Multi-Dimensional Attention with Similarity Constraint for Weakly-Supervised Temporal Action Localization

Weakly Supervised Temporal Action Localization Through Learning Explicit Subspaces for Action and Context.

Double branch synergies with modal reinforcement for weakly supervised temporal action detection

Weakly Supervised Temporal Action Localization Through Contrastive Learning

Exploring Sub-Action Granularity for Weakly Supervised Temporal Action Localization

Diffusion-based framework for weakly-supervised temporal action localization

SODA: Weakly Supervised Temporal Action Localization Based on Astute Background Response and Self-Distillation Learning

Adaptive Mutual Supervision for Weakly-Supervised Temporal Action Localization