Abstract:As a challenging task of high-level video understanding, weakly supervised temporal action localization has attracted more attention recently. With only video-level category labels, this task should indistinguishably identify the background and action categories frame by frame. However, it is non-trivial to achieve this in untrimmed videos, due to the unconstrained background, complex and multi-label actions. With the observation that these difficulties are mainly brought by the large variations within background and actions, we propose to address these challenges from the perspective of modeling variations. Moreover, it is desired to further reduce the variations, or learn compact features, so as to cast the problem of background identification as rejecting background and alleviate the contradiction between classification and detection. Accordingly, in this paper, we propose a two-branch relational prototypical network. The first branch, namely action-branch, adopts class-wise prototypes and mainly acts as an auxiliary to introduce priori knowledge about label dependencies and be a guide for the second branch. Meanwhile, the second branch, namely sub-branch, starts with multiple prototypes, namely sub-prototypes, to enable a powerful ability of modeling variations. As a further benefit, we elaborately design a multi-label clustering loss based on the sub-prototypes to learn compact features under the multi-label setting. The two branches are associated using the correspondences between two types of prototypes, leading to a special two-stage classifier in the s-branch, on the other hand, the two branches serve as regularization terms to each other, improving the final performance. Ablation studies find that the proposed model is capable of modeling classes with large variations and learning compact features. Extensive experimental evaluations on Thumos14, MultiThumos and ActivityNet datasets demonstrate the effectiveness of the proposed method and superior performa-ce over state-of-the-art approaches.

Action-Semantic Consistent Knowledge for Weakly-Supervised Action Localization

Weakly-Supervised Temporal Action Localization with Regional Similarity Consistency

Modeling Sub-Actions for Weakly Supervised Temporal Action Localization

Fine-grained Temporal Contrastive Learning for Weakly-supervised Temporal Action Localization

Weakly Supervised Temporal Action Localization via Representative Snippet Knowledge Propagation

Snippet-to-Prototype Contrastive Consensus Network for Weakly Supervised Temporal Action Localization

Action Shuffling for Weakly Supervised Temporal Localization

Weakly-Supervised Action Localization by Hierarchically-structured Latent Attention Modeling

Two-Branch Relational Prototypical Network for Weakly Supervised Temporal Action Localization

Weakly Supervised Temporal Action Localization With Bidirectional Semantic Consistency Constraint

Weakly-Supervised Temporal Action Localization with Bidirectional Semantic Consistency Constraint

Adaptive Mutual Supervision for Weakly-Supervised Temporal Action Localization

Completeness Modeling and Context Separation for Weakly Supervised Temporal Action Localization

Weakly-Supervised Temporal Action Localization by Progressive Complementary Learning

Weakly Supervised Temporal Action Localization Through Contrastive Learning

Towards Train-Test Consistency for Semi-supervised Temporal Action Localization

Adversarial Seeded Sequence Growing for Weakly-Supervised Temporal Action Localization

Cross-Video Contextual Knowledge Exploration and Exploitation for Ambiguity Reduction in Weakly Supervised Temporal Action Localization

Weakly Supervised Temporal Action Localization through Contrast based Evaluation Networks

Weakly-Supervised Temporal Action Localization Based on Attention Regularization

Learning Reliable Dense Pseudo-Labels for Point-Level Weakly-Supervised Action Localization