Abstract:Spatial–temporal action detection in videos is a challenging problem that has attracted considerable attention in recent years. Most current approaches address action detection as an object detection problem, which utilizes successful object detection frameworks such as Faster R-CNN to operate action detection at every single frame first, and then generates action tubes by linking bounding boxes across the whole video in an offline fashion. However, unlike object detection in static images, temporal context information is vital for action detection in videos. Therefore, we propose an online action detection model that leverages the spatial–temporal context information existing in videos to perform action inference and localization. More specifically, we try to depict the spatial–temporal context pattern of actions via an encoder–decoder model that is based on a convolutional recurrent neural network. The model accepts a video snippet as input and encodes the dynamic information inside the snippet in the forward pass. During the backward pass, the decoder resolves the information for action detection with the current appearance or motion cue at each time stamp. In addition, we devise an incremental action-tube construction algorithm that enables our model to accomplish action prediction ahead of time and performs action detection in an online fashion. To evaluate the performance of our method, we conduct experiments on three popular public datasets UCF-101, UCF-Sports, and J-HMDB-21. The experimental results demonstrate that our method can achieve competitive or superior performance when compared to the state-of-the-art methods. To encourage further research, we release our project on “https://github.com.hjjpku.OATD.”

Higher-order Network for Action Recognition

Learning Visual Context for Group Activity Recognition.

Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition

Learnable Higher-Order Representation for Action Recognition

Revisiting the Spatial and Temporal Modeling for Few-shot Action Recognition

Human Action Recognition with Contextual Constraints Using a RGB-D Sensor

An Attentional Spatial Temporal Graph Convolutional Network with Co-Occurrence Feature Learning for Action Recognition

Actor-Multi-Scale Context Bidirectional Higher Order Interactive Relation Network for Spatial-Temporal Action Localization.

Online Robust Action Recognition Based on a Hierarchical Model

Efficient Spatialtemporal Context Modeling for Action Recognition

Efficient spatiotemporal context modeling for action recognition

Reassessing Hierarchical Representation for Action Recognition in Still Images

Empowering Efficient Spatio-Temporal Learning with a 3D CNN for Pose-Based Action Recognition

Deep Alternative Neural Network: Exploring Contexts As Early As Possible for Action Recognition

A Novel Hierarchical Framework for Human Action Recognition

Action Recognition By Learning Deep Multi-Granular Spatio-Temporal Video Representation

Spatio-Temporal Adaptive Network with Bidirectional Temporal Difference for Action Recognition

Temporal Attentive Network for Action Recognition

Enhancing Action Recognition by Leveraging the Hierarchical Structure of Actions and Textual Context

Learning Latent Spatio-Temporal Compositional Model for Human Action Recognition

Spatial–Temporal Context-Aware Online Action Detection and Prediction