Abstract:Spatial–temporal action detection in videos is a challenging problem that has attracted considerable attention in recent years. Most current approaches address action detection as an object detection problem, which utilizes successful object detection frameworks such as Faster R-CNN to operate action detection at every single frame first, and then generates action tubes by linking bounding boxes across the whole video in an offline fashion. However, unlike object detection in static images, temporal context information is vital for action detection in videos. Therefore, we propose an online action detection model that leverages the spatial–temporal context information existing in videos to perform action inference and localization. More specifically, we try to depict the spatial–temporal context pattern of actions via an encoder–decoder model that is based on a convolutional recurrent neural network. The model accepts a video snippet as input and encodes the dynamic information inside the snippet in the forward pass. During the backward pass, the decoder resolves the information for action detection with the current appearance or motion cue at each time stamp. In addition, we devise an incremental action-tube construction algorithm that enables our model to accomplish action prediction ahead of time and performs action detection in an online fashion. To evaluate the performance of our method, we conduct experiments on three popular public datasets UCF-101, UCF-Sports, and J-HMDB-21. The experimental results demonstrate that our method can achieve competitive or superior performance when compared to the state-of-the-art methods. To encourage further research, we release our project on “https://github.com.hjjpku.OATD.”

Real-Time Action Detection Method Based on Multi-Scale Spatiotemporal Feature

Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition

Cascading Spatio-Temporal Attention Network for Real-Time Action Detection

Spatio-Temporal Attention Networks for Action Recognition and Detection

A Spatio-temporal Hybrid Network for Action Recognition

Multi-task CNN Model for Action Detection

Temporal Graph Convolutional Network for Multi-Agent Reinforcement Learning of Action Detection

Multi-Stream Single Shot Spatial-Temporal Action Detection

Multi-scale Dynamic Network for Temporal Action Detection.

Multi-scale Spatiotemporal Information Fusion Network for Video Action Recognition

Temporal adaptive feature pyramid network for action detection

Spatial–Temporal Context-Aware Online Action Detection and Prediction

Multi-Branch Spatial-Temporal Network for Action Recognition

Spatiotemporal Multi-Task Network for Human Activity Understanding.

A Human Action Recognition Model Inspired By Multiple Scale Temporal Segments Model Fusion

Action recognition with temporal scale-invariant deep learning framework

Multi-Modality Multi-Task Recurrent Neural Network for Online Action Detection

Local Feature Analysis for real-time Action Recognition.

Spatial-Temporal Neural Networks For Action Recognition

Integrating Temporal and Spatial Attention for Video Action Recognition

An Efficient Lightweight Spatio-temporal Attention Module for Action Recognition.