Abstract:Extracting effective spatial-temporal information is significantly important for video-based action recognition. Recently 3D convolutional neural networks (3D CNNs) that could simultaneously encode spatial and temporal dynamics in videos have made considerable progress in action recognition. However, almost all existing 3D CNN-based methods recognize human actions only using RGB videos. The single modality may limit the performance capacity of 3D networks. In this paper, we extend 3D CNN to depth and pose data besides RGB data to evaluate its capacity for spatiotemporal multimodal learning for video action recognition. We propose a novel multimodal two-stream 3D network framework, which can exploit complementary multimodal information to improve the recognition performance. Specifically, we first construct two discriminative video representations under depth and pose data modalities respectively, referred as depth residual dynamic image sequence (DRDIS) and pose estimation map sequence (PEMS). DRDIS captures spatial-temporal evolution of actions in depth videos by progressively aggregating the local motion information. PEMS eliminates the interference of cluttered backgrounds and describes the spatial configuration of body parts intuitively. The multimodal two-stream 3D CNN deals with two separate data streams to learn spatiotemporal features from DRDIS and PEMS representations. Finally, the classification scores from two streams are fused for action recognition. We conduct extensive experiments on four challenging action recognition datasets. The experimental results verify the effectiveness and superiority of our proposed method.

Learning Spatio-Temporal Features for Action Recognition from the Side of the Video

Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition

A Channel-Wise Spatial-Temporal Aggregation Network for Action Recognition

Collaborative Spatio-temporal Feature Learning for Video Action Recognition

Action Recognition By Learning Deep Multi-Granular Spatio-Temporal Video Representation

Temporal Distinct Representation Learning for Action Recognition

Spatiotemporal Multimodal Learning With 3D CNNs for Video Action Recognition

Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition

Learning Comprehensive Motion Representation for Action Recognition

Deep spatiotemporal LSTM network with temporal pattern feature for 3D human action recognition

Spatio-Temporal Collaborative Module for Efficient Action Recognition

Human Action Recognition under Log-Euclidean Riemannian Metric.

3D Action Recognition Using Multi-Temporal Depth Motion Maps and Fisher Vector

Empowering Efficient Spatio-Temporal Learning with a 3D CNN for Pose-Based Action Recognition

Spatio-temporal Laplacian Pyramid Coding for Action Recognition.

DB-LSTM: Densely-connected Bi-directional LSTM for Human Action Recognition

Modeling Geometric-Temporal Context with Directional Pyramid Co-Occurrence for Action Recognition

Spatio-Temporal Adaptive Network with Bidirectional Temporal Difference for Action Recognition

Temporal-Spatial Mapping for Action Recognition

Learning Latent Spatio-Temporal Compositional Model for Human Action Recognition

Mining Spatial and Spatio-Temporal ROIs for Action Recognition