Skeleton-Based Human Action Recognition Using Spatial Temporal 3D Convolutional Neural Networks

Juanhui Tu,Mengyuan Liu,Hong Liu
DOI: https://doi.org/10.1109/icme.2018.8486566
2018-01-01
Abstract:It remains a challenge to extract spatial-temporal information from skeleton sequences for 3D human action recognition. Although most recent action recognition methods based on Recurrent Neural Networks (RNN) have achieved outstanding performance, one of the shortcomings of these methods is the tendency to overemphasize the temporal information. Since 3D Convolutional Neural Networks(3D CNN) can simultaneously learn features from both spatial and temporal dimensions through capturing correlations among three-dimensional signals, this paper proposes a novel two-stream model using 3D CNN. To our best knowledge, this is the first attempt to use 3D CNN in the field of skeleton-based action recognition. Our method consists of three stages. First, skeleton joints are mapped into a 3D coordinate space to encode the spatial and temporal information. Second, 3D CNN models are separately employed to extract deep features from both spatial and temporal stream. Third, to enhance the ability of discriminative features to capture global relationships, we extend each stream into multi-temporal version. Extensive experiments on the large-scale NTU RGB-D dataset and the public SmartHome dataset demonstrate that our method outperforms most of RNN-based methods, which verify the complementary property between spatial and temporal information and the robustness to noise.
What problem does this paper attempt to address?