Abstract:Temporal action localization presents a significant challenge in computer vision, as the development of an efficient method for this task remains elusive. The objective is to identify human activities within untrimmed videos, determining when and which actions occur in each video. While using trimmed videos could potentially resolve the localization problem and enhance classification accuracy, it is impractical for real-world applications as the trimming process itself requires human intervention. This highlights the importance of temporal localization. Due to the availability of several successful approaches for action recognition in trimmed video, conventional multi-stage methods for untrimmed video, commonly employ a network to generate activity proposals, followed by a separate network for classification. These disjoint networks are optimized individually and thus usually vary from the global optimum, leading to less precise candidate action proposals. To address this challenge, we propose a novel end-to-end neural network that utilizes error estimation for precise action localization and recognition in untrimmed videos. The proposed method performs the localization and classification of action instances simultaneously, thereby optimizing the corresponding networks concurrently. To increase the precision of the action proposal boundaries, the Regression module is innovatively utilized as part of the proposed end-to-end network, along with the Evaluation and Classification modules. This module estimates the potential error in proposal time boundaries and enhances the result accuracy. We have conducted experiments on THUMOS 14 and ActivityNet-1.3, which are considered the most challenging datasets for temporal action localization. The novel, yet fairly simple, proposed network achieves remarkable performance improvement compared to the other state-of-the-art methods. This improvement, which is more pronounced in the cases of high temporal intersection with ground truth, is accomplished without requiring extra data or complicated architecture. By incorporating error estimation, we achieved improvement in mean Average Precision (mAP). The proposed approach particularly shines for the localization of challenging activities in the complex and diverse dataset ActivityNet-1.3. For instance, for the "drinking coffee" activity, the mean Average Precision (mAP) was enhanced fivefold compared to the best-reported results.

Localizing Unseen Activities in Video Via Image Query

Method for locating unlearned activities in video through image query

Video-Specific Query-Key Attention Modeling for Weakly-Supervised Temporal Action Localization

Rethinking the Bottom-Up Framework for Query-Based Video Localization

A Joint Model for Action Localization and Classification in Untrimmed Video with Visual Attention

Localizing Activity Groups in Videos.

Localizing Events in Videos with Multimodal Queries

Skimming, Locating, then Perusing: A Human-Like Framework for Natural Language Video Localization

VAL: Visual-Attention Action Localizer

UAL-Bench: The First Comprehensive Unusual Activity Localization Benchmark

Fine-grained Iterative Attention Network for Temporal Language Localization in Videos

Enhancing temporal action localization in an end-to-end network through estimation error incorporation

Unintentional Action Localization Via Counterfactual Examples

Annotation-Efficient Untrimmed Video Action Recognition

Query by Activity Video in the Wild

Video Activity Localisation with Uncertainties in Temporal Boundary

Weakly-Supervised Action Localization by Hierarchically-structured Latent Attention Modeling

Relation Attention for Temporal Action Localization

Learning to Localize Actions from Moments

Cross-Video Contextual Knowledge Exploration and Exploitation for Ambiguity Reduction in Weakly Supervised Temporal Action Localization

Temporal Textual Localization in Video Via Adversarial Bi-Directional Interaction Networks