Abstract:Temporal action localization is currently an active research topic in computer vision and machine learning due to its usage in smart surveillance. It is a challenging problem since the categories of the actions must be classified in untrimmed videos and the start and end of the actions need to be accurately found. Although many temporal action localization methods have been proposed, they require substantial amounts of computational resources for the training and inference processes. To solve these issues, in this work, a novel temporal-aware relation and attention network (abbreviated as TRA) is proposed for the temporal action localization task. TRA has an anchor-free and end-to-end architecture that fully uses temporal-aware information. Specifically, a temporal self-attention module is first designed to determine the relationship between different temporal positions, and more weight is given to features within the actions. Then, a multiple temporal aggregation module is constructed to aggregate the temporal domain information. Finally, a graph relation module is designed to obtain the aggregated graph features, which are used to refine the boundaries and classification results. Most importantly, these three modules are jointly explored in a unified framework, and temporal awareness is always fully used. Extensive experiments demonstrate that the proposed method can outperform all state-of-the-art methods on the THUMOS14 dataset with an average mAP that reaches 67.6% and obtain a comparable result on the ActivityNet1.3 dataset with an average mAP that reaches 34.4%. Compared with A2Net (TIP20), PCG-TAL (TIP21), and AFSD (CVPR21) TRA can achieve improvements of 11.7%, 4.4%, and 1.8%, respectively on the THUMOS14 dataset.

Relation Modeling in Spatio-Temporal Action Localization

Revisiting the Spatial and Temporal Modeling for Few-shot Action Recognition

Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization

Relation Attention for Temporal Action Localization

Video Relation Detection with Spatio-Temporal Graph

Complementary Boundary Generator with Scale-Invariant Relation Modeling for Temporal Action Localization: Submission to ActivityNet Challenge 2020

Actor-Multi-Scale Context Bidirectional Higher Order Interactive Relation Network for Spatial-Temporal Action Localization.

Shifting Perspective to See Difference: A Novel Multi-View Method for Skeleton Based Action Recognition

Three Branches: Detecting Actions With Richer Features

Dual relation network for temporal action localization

A Temporal-Aware Relation and Attention Network for Temporal Action Localization

Human Activity Recognition based on Dynamic Spatio-Temporal Relations

Video Visual Relation Detection Via Multi-modal Feature Fusion

B2C-AFM: Bi-Directional Co-Temporal and Cross-Spatial Attention Fusion Model for Human Action Recognition.

Completeness Modeling and Context Separation for Weakly Supervised Temporal Action Localization

Relational Long Short-Term Memory for Video Action Recognition

Object-Aware Multi-Branch Relation Networks for Spatio-Temporal Video Grounding

Spatial-Temporal Alignment Network for Action Recognition and Detection

Long Short-Term Relation Networks for Video Action Detection

Temporal Fusion Network for Temporal Action Localization:Submission to ActivityNet Challenge 2020 (Task E)

Progressive Relation Learning for Group Activity Recognition