Abstract:Extracting effective spatial-temporal information is significantly important for video-based action recognition. Recently 3D convolutional neural networks (3D CNNs) that could simultaneously encode spatial and temporal dynamics in videos have made considerable progress in action recognition. However, almost all existing 3D CNN-based methods recognize human actions only using RGB videos. The single modality may limit the performance capacity of 3D networks. In this paper, we extend 3D CNN to depth and pose data besides RGB data to evaluate its capacity for spatiotemporal multimodal learning for video action recognition. We propose a novel multimodal two-stream 3D network framework, which can exploit complementary multimodal information to improve the recognition performance. Specifically, we first construct two discriminative video representations under depth and pose data modalities respectively, referred as depth residual dynamic image sequence (DRDIS) and pose estimation map sequence (PEMS). DRDIS captures spatial-temporal evolution of actions in depth videos by progressively aggregating the local motion information. PEMS eliminates the interference of cluttered backgrounds and describes the spatial configuration of body parts intuitively. The multimodal two-stream 3D CNN deals with two separate data streams to learn spatiotemporal features from DRDIS and PEMS representations. Finally, the classification scores from two streams are fused for action recognition. We conduct extensive experiments on four challenging action recognition datasets. The experimental results verify the effectiveness and superiority of our proposed method.

Learning 3D Compact Binary Descriptor for Human Action Recognition in Video.

Multi-Temporal Depth Motion Maps-Based Local Binary Patterns for 3-D Human Action Recognition

A Novel 3D Gradient LBP Descriptor for Action Recognition

Action Recognition with Multi-Scale Trajectory-Pooled 3D Convolutional Descriptors

3D Action Recognition Using Multi-Temporal Depth Motion Maps and Fisher Vector

Human Action Recognition Based on DMMs, HOGs and Contourlet Transform

DC3D: A Video Action Recognition Network Based on Dense Connection

Human Action Recognition Using Deep Learning Methods.

Combining Sparse And Dense Descriptors With Temporal Semantic Structures For Robust Human Action Recognition

Human action recognition for 3D video based on action standard sequence

Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors

Efficient Human Action Recognition Interface for Augmented and Virtual Reality Applications Based on Binary Descriptor.

Human Action Recognition with Contextual Constraints Using a RGB-D Sensor

Human action recognition using Adaptive Hierarchical Depth Motion Maps and Gabor filter

Learning Compact Binary Face Descriptor for Face Recognition.

Binary "proximity Patches Motion" Descriptor for Action Recognition in Videos

Human Action Recognition with Trajectory Based Covariance Descriptor in Unconstrained Videos

Spatiotemporal Multimodal Learning With 3D CNNs for Video Action Recognition

A Compact Representation of Human Actions by Sliding Coordinate Coding

Learning Deep Trajectory Descriptor for Action Recognition in Videos Using Deep Neural Networks.

Modeling Geometric-Temporal Context with Directional Pyramid Co-Occurrence for Action Recognition