A Novel View Attention Network for Skeleton Based Human Action Recognition*

Shaochen Li,Zhenyu Liu,Jianrong Tan
DOI: https://doi.org/10.1109/ceect53198.2021.9672614
2021-01-01
Abstract:Skeleton based human action recognition is becoming increasingly popular nowadays thanks to the development of low-cost depth sensors and pose estimation techniques. To enrich the expression of human skeletal characteristics and enhance generalization ability of models, new approaches are proposed to utilize human skeletons observed by multiple viewpoints for feature extracting. Despite the significant progress on these multi-view skeletons based approaches, the intrinsic correlation among views and fusion form of features have not been extensively investigated. In order to tackle these problems, we proposed a novel View Attention Network (VANet), which can learn the relationship of different views and fuse the multi-view features effectively. First, the spatio-temporal dynamics of human skeletons are encoded in the multi-view skeletal arrays. Then, a multi-branch Convolutional Neural Network (CNN) is adopted for extracting features from multiple views. Moreover, we design a view attention module to capture the correlation across different views. Particularly, we expend the module to a multi-head format to increase the feature spaces and enhance the robustness of entire network. Finally, an aggregated feature is learned from the module for final recognition. Extensive experiments on public NTU RGB+D 60 and SBU Kinect Interaction datasets show that our approach can achieve state-of-the-art results.
What problem does this paper attempt to address?