Abstract:Temporal action localization presents a significant challenge in computer vision, as the development of an efficient method for this task remains elusive. The objective is to identify human activities within untrimmed videos, determining when and which actions occur in each video. While using trimmed videos could potentially resolve the localization problem and enhance classification accuracy, it is impractical for real-world applications as the trimming process itself requires human intervention. This highlights the importance of temporal localization. Due to the availability of several successful approaches for action recognition in trimmed video, conventional multi-stage methods for untrimmed video, commonly employ a network to generate activity proposals, followed by a separate network for classification. These disjoint networks are optimized individually and thus usually vary from the global optimum, leading to less precise candidate action proposals. To address this challenge, we propose a novel end-to-end neural network that utilizes error estimation for precise action localization and recognition in untrimmed videos. The proposed method performs the localization and classification of action instances simultaneously, thereby optimizing the corresponding networks concurrently. To increase the precision of the action proposal boundaries, the Regression module is innovatively utilized as part of the proposed end-to-end network, along with the Evaluation and Classification modules. This module estimates the potential error in proposal time boundaries and enhances the result accuracy. We have conducted experiments on THUMOS 14 and ActivityNet-1.3, which are considered the most challenging datasets for temporal action localization. The novel, yet fairly simple, proposed network achieves remarkable performance improvement compared to the other state-of-the-art methods. This improvement, which is more pronounced in the cases of high temporal intersection with ground truth, is accomplished without requiring extra data or complicated architecture. By incorporating error estimation, we achieved improvement in mean Average Precision (mAP). The proposed approach particularly shines for the localization of challenging activities in the complex and diverse dataset ActivityNet-1.3. For instance, for the "drinking coffee" activity, the mean Average Precision (mAP) was enhanced fivefold compared to the best-reported results.

Confidence-Guided Self Refinement for Action Prediction in Untrimmed Videos

Knowledge-guided Pre-Training and Fine-Tuning: Video Representation Learning for Action Recognition

Revisiting the Spatial and Temporal Modeling for Few-shot Action Recognition

Exploiting Semantic-Level Affinities with a Mask-Guided Network for Temporal Action Proposal in Videos.

Learning Transferable Self-attentive Representations for Action Recognition in Untrimmed Videos with Weak Supervision

Active Temporal Action Detection in Untrimmed Videos Via Deep Reinforcement Learning

Annotation-Efficient Untrimmed Video Action Recognition

Temporal DINO: A Self-supervised Video Strategy to Enhance Action Prediction

Efficient Action Detection in Untrimmed Videos via Multi-Task Learning

A Joint Model for Action Localization and Classification in Untrimmed Video with Visual Attention

Gated forward refinement network for action segmentation

Spatial–Temporal Context-Aware Online Action Detection and Prediction

A Discussion of Data Sampling Strategies for Early Action Prediction

TwinNet: Twin Structured Knowledge Transfer Network for Weakly Supervised Action Localization

UntrimmedNets for Weakly Supervised Action Recognition and Detection

Enhancing early action prediction in videos through temporal composition of sub-actions

An Active Action Proposal Method Based on Reinforcement Learning

Intentional Evolutionary Learning for Untrimmed Videos with Long Tail Distribution

Enhancing temporal action localization in an end-to-end network through estimation error incorporation

Towards Weakly Supervised End-to-end Learning for Long-video Action Recognition

Weakly supervised temporal action localization with actionness-guided false positive suppression