A Cross View Learning Approach for Skeleton-Based Action Recognition

Hui Zheng,Xinming Zhang
DOI: https://doi.org/10.1109/tcsvt.2021.3100128
IF: 5.859
2022-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:With the prevalence of accessible multi-modal sensors and the maturity of pose estimation algorithms, skeleton-based action recognition has gradually become the mainstream of human action recognition (HAR). The key issue is to mine the correlations and dependencies between different joints and bones. In this paper, we propose a cross view learning approach. First, the static and dynamic representations of skeletons, from two different views (joints and bones), are calculated and aggregated respectively. Then, the integrated representations of these two views are used as parallel inputs to the cross view learning model, which mainly includes two blocks, namely a multi-scale learning block and a multi-view fusion block. The former is used to excavate the intra-view's discriminative and comprehensive features, and the latter is utilized to capture the complementary representations of the inter-view. Finally, the fused representations are input to the classifier for action recognition. It has been experimentally proven that our proposed approach outperforms several state-of-the-art baseline methods and achieves a very competitive performance.
What problem does this paper attempt to address?