Abstract:Temporal action localization presents a significant challenge in computer vision, as the development of an efficient method for this task remains elusive. The objective is to identify human activities within untrimmed videos, determining when and which actions occur in each video. While using trimmed videos could potentially resolve the localization problem and enhance classification accuracy, it is impractical for real-world applications as the trimming process itself requires human intervention. This highlights the importance of temporal localization. Due to the availability of several successful approaches for action recognition in trimmed video, conventional multi-stage methods for untrimmed video, commonly employ a network to generate activity proposals, followed by a separate network for classification. These disjoint networks are optimized individually and thus usually vary from the global optimum, leading to less precise candidate action proposals. To address this challenge, we propose a novel end-to-end neural network that utilizes error estimation for precise action localization and recognition in untrimmed videos. The proposed method performs the localization and classification of action instances simultaneously, thereby optimizing the corresponding networks concurrently. To increase the precision of the action proposal boundaries, the Regression module is innovatively utilized as part of the proposed end-to-end network, along with the Evaluation and Classification modules. This module estimates the potential error in proposal time boundaries and enhances the result accuracy. We have conducted experiments on THUMOS 14 and ActivityNet-1.3, which are considered the most challenging datasets for temporal action localization. The novel, yet fairly simple, proposed network achieves remarkable performance improvement compared to the other state-of-the-art methods. This improvement, which is more pronounced in the cases of high temporal intersection with ground truth, is accomplished without requiring extra data or complicated architecture. By incorporating error estimation, we achieved improvement in mean Average Precision (mAP). The proposed approach particularly shines for the localization of challenging activities in the complex and diverse dataset ActivityNet-1.3. For instance, for the "drinking coffee" activity, the mean Average Precision (mAP) was enhanced fivefold compared to the best-reported results.

A Joint Model for Action Localization and Classification in Untrimmed Video with Visual Attention

Efficient Action Detection in Untrimmed Videos via Multi-Task Learning

Weakly-Supervised Action Localization by Hierarchically-structured Latent Attention Modeling

Interpretable Spatio-temporal Attention for Video Action Recognition

TwinNet: Twin Structured Knowledge Transfer Network for Weakly Supervised Action Localization

Relation Attention for Temporal Action Localization

Where and When to Look? Spatio-temporal Attention for Action Recognition in Videos.

Enhancing temporal action localization in an end-to-end network through estimation error incorporation

An Active Action Proposal Method Based on Reinforcement Learning

Video-Specific Query-Key Attention Modeling for Weakly-Supervised Temporal Action Localization

Structured Attention Composition for Temporal Action Localization

A Temporal-Aware Relation and Attention Network for Temporal Action Localization

Weakly-Supervised Action Localization by Generative Attention Modeling

Weakly-Supervised Temporal Action Localization Based on Attention Regularization

Annotation-Efficient Untrimmed Video Action Recognition

Weakly-supervised action localization via embedding-modeling iterative optimization

Learning Temporal Co-Attention Models for Unsupervised Video Action Localization.

Active Temporal Action Detection in Untrimmed Videos Via Deep Reinforcement Learning

Completeness Modeling and Context Separation for Weakly Supervised Temporal Action Localization

Alignment-guided Temporal Attention for Video Action Recognition

Online Human Action Detection using Joint Classification-Regression Recurrent Neural Networks