Abstract:Recent two-stream deep Convolutional Neural Networks (ConvNets) have made significant progress in recognizing human actions in videos. Despite their success, methods extending the basic two-stream ConvNet have not systematically explored possible network architectures to further exploit spatiotemporal dynamics within video sequences. Further, such networks often use different baseline two-stream networks. Therefore, the differences and the distinguishing factors between various methods using Recurrent Neural Networks (RNN) or convolutional networks on temporally-constructed feature vectors (Temporal-ConvNet) are unclear. In this work, we first demonstrate a strong baseline two-stream ConvNet using ResNet-101. We use this baseline to thoroughly examine the use of both RNNs and Temporal-ConvNets for extracting spatiotemporal information. Building upon our experimental results, we then propose and investigate two different networks to further integrate spatiotemporal information: 1) temporal segment RNN and 2) Inception-style Temporal-ConvNet. We demonstrate that using both RNNs (using LSTMs) and Temporal-ConvNets on spatiotemporal feature matrices are able to exploit spatiotemporal dynamics to improve the overall performance. However, each of these methods require proper care to achieve state-of-the-art performance; for example, LSTMs require pre-segmented data or else they cannot fully exploit temporal information. Our analysis identifies specific limitations for each method that could form the basis of future work. Our experimental results on UCF101 and HMDB51 datasets achieve state-of-the-art performances, 94.1% and 69.0%, respectively, without requiring extensive temporal augmentation.

Mutually Reinforced Spatio-Temporal Convolutional Tube for Human Action Recognition.

Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition

Multi-scale Spatial-Temporal Integration Convolutional Tube for Human Action Recognition

MiCT: Mixed 3D/2D Convolutional Tube for Human Action Recognition

A Channel-Wise Spatial-Temporal Aggregation Network for Action Recognition

An Attentional Spatial Temporal Graph Convolutional Network with Co-Occurrence Feature Learning for Action Recognition

Human Action Recognition Based on Three-Stream Network with Frame Sequence Features

Spatio-Temporal Attention Networks for Action Recognition and Detection

Human Action Recognition Combining Sequential Dynamic Images and Two-Stream Convolutional Network

An Enhanced 3dcnn-Convlstm For Spatiotemporal Multimedia Data Analysis

Spatiotemporal Multi-Task Network for Human Activity Understanding.

A Spatio-Temporal Transformer Network for Human Motion Prediction in Human-Robot Collaboration

DC3D: A Video Action Recognition Network Based on Dense Connection

MSST-RT: Multi-Stream Spatial-Temporal Relative Transformer for Skeleton-Based Action Recognition

TS-LSTM and Temporal-Inception: Exploiting Spatiotemporal Dynamics for Activity Recognition

Spatiotemporal Interaction Residual Networks with Pseudo3D for Video Action Recognition.

Unified Spatio-Temporal Attention Models for Advanced Human Action Recognition & Detection

Skeleton-Based Human Action Recognition Using Spatial Temporal 3D Convolutional Neural Networks

Spatio-Temporal Collaborative Module for Efficient Action Recognition

Deep spatiotemporal LSTM network with temporal pattern feature for 3D human action recognition

Local-aware spatio-temporal attention network with multi-stage feature fusion for human action recognition