Abstract:Extracting effective spatial-temporal information is significantly important for video-based action recognition. Recently 3D convolutional neural networks (3D CNNs) that could simultaneously encode spatial and temporal dynamics in videos have made considerable progress in action recognition. However, almost all existing 3D CNN-based methods recognize human actions only using RGB videos. The single modality may limit the performance capacity of 3D networks. In this paper, we extend 3D CNN to depth and pose data besides RGB data to evaluate its capacity for spatiotemporal multimodal learning for video action recognition. We propose a novel multimodal two-stream 3D network framework, which can exploit complementary multimodal information to improve the recognition performance. Specifically, we first construct two discriminative video representations under depth and pose data modalities respectively, referred as depth residual dynamic image sequence (DRDIS) and pose estimation map sequence (PEMS). DRDIS captures spatial-temporal evolution of actions in depth videos by progressively aggregating the local motion information. PEMS eliminates the interference of cluttered backgrounds and describes the spatial configuration of body parts intuitively. The multimodal two-stream 3D CNN deals with two separate data streams to learn spatiotemporal features from DRDIS and PEMS representations. Finally, the classification scores from two streams are fused for action recognition. We conduct extensive experiments on four challenging action recognition datasets. The experimental results verify the effectiveness and superiority of our proposed method.

DC3D: A Video Action Recognition Network Based on Dense Connection

Human Action Recognition Based on Three-Stream Network with Frame Sequence Features

A Novel 3D Convolutional Neural Network for Action Recognition in Infrared Videos

A Spatio-temporal Hybrid Network for Action Recognition

T-C3D: Temporal Convolutional 3D Network for Real-Time Action Recognition

Long-term 3D Convolutional Fusion Network for Action Recognition

Deep Spatiotemporal Relation Learning with 3D Multi-Level Dense Fusion for Video Action Recognition

Action Recognition in Videos with Spatio-Temporal Fusion 3D Convolutional Neural Networks

3D Residual Networks with Channel-Spatial Attention Module for Action Recognition

Action recognition method based on a novel keyframe extraction method and enhanced 3D convolutional neural network

Efficient Parallel Inflated 3D Convolution Architecture for Action Recognition.

Spatiotemporal Multimodal Learning With 3D CNNs for Video Action Recognition

D3D: Dual 3-D Convolutional Network for Real-Time Action Recognition

An Improved Action Recognition Network Based on Appearance and Relation

3D-TDC: A 3D temporal dilation convolution framework for video action recognition

Interaction Recognition Using Depth Information Based on 3D CNNs

3D Convolutional Neural Network for Action Recognition.

Enhanced Action Recognition With Visual Attribute-Augmented 3D Convolutional Neural Network

Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition

Temporal Residual Feature Learning for Efficient 3D Convolutional Neural Network on Action Recognition Task

3D Convolutional Two-Stream Network for Action Recognition in Videos