Abstract:Spatial–temporal action detection in videos is a challenging problem that has attracted considerable attention in recent years. Most current approaches address action detection as an object detection problem, which utilizes successful object detection frameworks such as Faster R-CNN to operate action detection at every single frame first, and then generates action tubes by linking bounding boxes across the whole video in an offline fashion. However, unlike object detection in static images, temporal context information is vital for action detection in videos. Therefore, we propose an online action detection model that leverages the spatial–temporal context information existing in videos to perform action inference and localization. More specifically, we try to depict the spatial–temporal context pattern of actions via an encoder–decoder model that is based on a convolutional recurrent neural network. The model accepts a video snippet as input and encodes the dynamic information inside the snippet in the forward pass. During the backward pass, the decoder resolves the information for action detection with the current appearance or motion cue at each time stamp. In addition, we devise an incremental action-tube construction algorithm that enables our model to accomplish action prediction ahead of time and performs action detection in an online fashion. To evaluate the performance of our method, we conduct experiments on three popular public datasets UCF-101, UCF-Sports, and J-HMDB-21. The experimental results demonstrate that our method can achieve competitive or superior performance when compared to the state-of-the-art methods. To encourage further research, we release our project on “https://github.com.hjjpku.OATD.”

E^2TAD: an Energy-Efficient Tracking-based Action Detector

Fast-Tracker 2.0: Improving Autonomy of Aerial Tracking with Active Vision and Human Location Regression.

A Tracking-Based Two-Stage Framework for Spatio-Temporal Action Detection

TEDdet: Temporal Feature Exchange and Difference Network for Online Real-Time Action Detection

ETAD: Training Action Detection End to End on a Laptop.

Fast and Accurate Action Detection in Videos With Motion-Centric Attention Model

A Simple and Efficient Pipeline to Build an End-to-End Spatial-Temporal Action Detector

Spatial–Temporal Context-Aware Online Action Detection and Prediction

Actions As Points: a Simple and Efficient Detector for Skeleton-Based Temporal Action Detection.

An Empirical Study of End-to-End Temporal Action Detection

Cascading Spatio-Temporal Attention Network for Real-Time Action Detection

Faster-TAD: Towards Temporal Action Detection with Proposal Generation and Classification in a Unified Network

Detecting Action Tubes Via Spatial Action Estimation and Temporal Path Inference.

PAMI-AD: An Activity Detector Exploiting Part-attention and Motion Information in Surveillance Videos

Action Recognition Based on Object Tracking and Dense Trajectories

Single Shot Temporal Action Detection.

Efficient Video Action Detection with Token Dropout and Context Refinement.

ZSTAD: Zero-Shot Temporal Activity Detection

Temporal Action Localization with Enhanced Instant Discriminability

Detecting Human Actions in Surveillance Videos

Detecting video events based on action recognition in complex scenes using spatio-temporal descriptor.