Spatio-Temporal Attention Deep Network for Skeleton Based View-Invariant Human Action Recognition

Feng Yan,Li Ge,Yuan Chunfeng,Wang Chuanxu
DOI: https://doi.org/10.3724/sp.j.1089.2018.17095
2018-01-01
Journal of Computer-Aided Design & Computer Graphics
Abstract:Skeleton-based human action recognition has been widely studied recently with the advancement of depth capturing devices. However, the skeleton data captured from a single camera is visually view-dependent and contains noise. In this paper, we propose a spatiao-temporal and view attention based deep network model to avoid the disturbance of the view and noise in skeleton data for human action recognition. Our model consists of two sub-networks which are built on the Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM). The view-specific sub-network incorporating spatio-temporal attention learns discriminative features from single input view by paying more attention to key joints and frames. The following view attention sub-network obtains common view-invariant representations shared among views and it contains a view attention module to select the discriminative views. Finally, we propose a regularized cross-entropy loss to ensure the effective end-to-end training of the network. Experimental results demonstrate the effectiveness of the proposed model on the current largest NTU action recognition dataset.
What problem does this paper attempt to address?