Abstract:With the explosive growth of videos, weakly-supervised temporal action localization (WS-TAL) task has become a promising research direction in pattern analysis and machine learning. WS-TAL aims to detect and localize action instances with only video-level labels during training. Modern approaches have achieved impressive progress via powerful deep neural networks. However, robust and reliable WS-TAL remains challenging and underexplored due to considerable uncertainty caused by weak supervision, noisy evaluation environment, and unknown categories in the open world. To this end, we propose a new paradigm, named vectorized evidential learning (VEL), to explore local-to-global evidence collection for facilitating model performance. Specifically, a series of learnable meta-action units (MAUs) are automatically constructed, which serve as fundamental elements constituting diverse action categories. Since the same meta-action unit can manifest as distinct action components within different action categories, we leverage MAUs and category representations to dynamically and adaptively learn action components and action-component relations. After performing uncertainty estimation at both category-level and unit-level, the local evidence from action components is accumulated and optimized under the Subject Logic theory. Extensive experiments on the regular, noisy, and open-set settings of three popular benchmarks show that VEL consistently obtains more robust and reliable action localization performance than state-of-the-arts.

VAL: Visual-Attention Action Localizer

Hierarchical Visual-Textual Graph for Temporal Activity Localization via Language

DeTAL: Open-Vocabulary Temporal Action Localization with Decoupled Networks

ContextLoc++: A Unified Context Model for Temporal Action Localization

Action Sensitivity Learning for Temporal Action Localization

Distilling Vision-Language Pre-training to Collaborate with Weakly-Supervised Temporal Action Localization

Learning Temporal Co-Attention Models for Unsupervised Video Action Localization.

Exploring Scalability of Self-Training for Open-Vocabulary Temporal Action Localization

Vectorized Evidential Learning for Weakly-supervised Temporal Action Localization

Action Coherence Network for Weakly-Supervised Temporal Action Localization

Localizing Unseen Activities in Video Via Image Query

CoLA: Weakly-Supervised Temporal Action Localization with Snippet Contrastive Learning

Advancing Temporal Action Localization with a Boundary Awareness Network

MA-VLAD: a fine-grained local feature aggregation scheme for action recognition

Camg: Context-Aware Moment Graph Network for Multimodal Temporal Activity Localization Via Language

Visual-Linguistic Agent: Towards Collaborative Contextual Object Reasoning

The Solution for Temporal Action Localisation Task of Perception Test Challenge 2024

Probabilistic Vision-Language Representation for Weakly Supervised Temporal Action Localization

Fine-grained Iterative Attention Network for Temporal Language Localization in Videos

Open-Vocabulary Temporal Action Localization using Multimodal Guidance