Abstract:Spatial–temporal action detection in videos is a challenging problem that has attracted considerable attention in recent years. Most current approaches address action detection as an object detection problem, which utilizes successful object detection frameworks such as Faster R-CNN to operate action detection at every single frame first, and then generates action tubes by linking bounding boxes across the whole video in an offline fashion. However, unlike object detection in static images, temporal context information is vital for action detection in videos. Therefore, we propose an online action detection model that leverages the spatial–temporal context information existing in videos to perform action inference and localization. More specifically, we try to depict the spatial–temporal context pattern of actions via an encoder–decoder model that is based on a convolutional recurrent neural network. The model accepts a video snippet as input and encodes the dynamic information inside the snippet in the forward pass. During the backward pass, the decoder resolves the information for action detection with the current appearance or motion cue at each time stamp. In addition, we devise an incremental action-tube construction algorithm that enables our model to accomplish action prediction ahead of time and performs action detection in an online fashion. To evaluate the performance of our method, we conduct experiments on three popular public datasets UCF-101, UCF-Sports, and J-HMDB-21. The experimental results demonstrate that our method can achieve competitive or superior performance when compared to the state-of-the-art methods. To encourage further research, we release our project on “https://github.com.hjjpku.OATD.”

Streamer Temporal Action Detection in Live Video by Co-Attention Boundary Matching

Streamer action recognition in live video with spatial-temporal attention and deep dictionary learning.

Enhanced Action Tubelet Detector for Spatio-Temporal Video Action Detection

Three-Stream Action Tubelet Detector for Spatiotemporal Action Detection in Videos.

Domain adaptation with optimized feature distribution for streamer action recognition in live video

Exploiting Attention-Consistency Loss for Spatial-Temporal Stream Action Recognition.

An Improved Attention-Based Spatiotemporal-Stream Model for Action Recognition in Videos

Action Detection with Two-Stream Enhanced Detector

Cascading Spatio-Temporal Attention Network for Real-Time Action Detection

Human Action Recognition Based on Three-Stream Network with Frame Sequence Features

Two-Stream Completeness Modeling for Weakly Supervised Temporal Action Detection

Meta-Learning Paradigm and CosAttn for Streamer Action Recognition in Live Video

Online Action Tube Detection Via Resolving The Spatio-Temporal Context Pattern

Spatial–Temporal Context-Aware Online Action Detection and Prediction

Uncertainty-Based Spatial-Temporal Attention for Online Action Detection.

A survey on deep learning-based spatio-temporal action detection

Unified Spatio-Temporal Attention Networks for Action Recognition in Videos.

Towards Practical Compressed Video Action Recognition: A Temporal Enhanced Multi-Stream Network

Joint Network based Attention for Action Recognition

Proposal Complementary Action Detection

Spatial-temporal Interaction Learning Based Two-Stream Network for Action Recognition