Abstract:Recently, temporal action localization (TAL) methods, especially the weakly-supervised and unsupervised ones, have become a hot research topic. Existing unsupervised methods follow an iterative ''clustering and training'' strategy with diverse model designs during training stage, while they often overlook maintaining consistency between these stages, which is crucial: more accurate clustering results can reduce the noises of pseudolabels and thus enhance model training, while more robust training can in turn enrich clustering feature representation. We identify two critical challenges in unsupervised scenarios: 1. What features should the model generate for clustering? 2. Which pseudolabeled instances from clustering should be chosen for model training? After extensive explorations, we proposed a novel yet simple framework called Consistency-Oriented Progressive high actionness Learning to address these issues. For feature generation, our framework adopts a High Actionness snippet Selection (HAS) module to generate more discriminative global video features for clustering from the enhanced actionness features obtained from a designed Inner-Outer Consistency Network (IOCNet). For pseudolabel selection, we introduces a Progressive Learning With Representative Instances (PLRI) strategy to identify the most reliable and informative instances within each cluster for model training. These three modules, HAS, IOCNet, and PLRI, synergistically improve consistency in model training and clustering performance. Extensive experiments on THUMOS'14 and ActivityNet v1.2 datasets under both unsupervised and weakly-supervised settings demonstrate that our framework achieves the state-of-the-art results.

Revisiting Unsupervised Temporal Action Localization: the Primacy of High-Quality Actionness and Pseudolabels

Learning Temporal Co-Attention Models for Unsupervised Video Action Localization.

Towards Adaptive Pseudo-label Learning for Semi-Supervised Temporal Action Localization

Weakly-supervised Temporal Action Localization with Adaptive Clustering and Refining Network

APSL: Action-positive Separation Learning for Unsupervised Temporal Action Localization

Active learning with effective scoring functions for semi-supervised temporal action localization

Weakly-Supervised Action Localization by Hierarchically-structured Latent Attention Modeling

Realigning Confidence with Temporal Saliency Information for Point-Level Weakly-Supervised Temporal Action Localization

Revisiting Foreground and Background Separation in Weakly-supervised Temporal Action Localization: A Clustering-based Approach

Bottom-Up Temporal Action Localization with Mutual Regularization

Weakly-Supervised Temporal Action Localization Based on Attention Regularization

Boosting Semi-Supervised Temporal Action Localization by Learning from Non-Target Classes

Weakly-Supervised Temporal Action Localization by Progressive Complementary Learning

Modeling Sub-Actions for Weakly Supervised Temporal Action Localization

Action Sensitivity Learning for Temporal Action Localization

Sub-action Prototype Learning for Point-level Weakly-supervised Temporal Action Localization

Completeness Modeling and Context Separation for Weakly Supervised Temporal Action Localization

Probabilistic Temporal Modeling for Unintentional Action Localization

Adaptive Mutual Supervision for Weakly-Supervised Temporal Action Localization

ADM-Loc: Actionness Distribution Modeling for Point-supervised Temporal Action Localization

Towards Train-Test Consistency for Semi-supervised Temporal Action Localization