Part-wise Spatio-temporal Attention Driven CNN-based 3D Human Action Recognition

Chhavi Dhiman,Dinesh Kumar Vishwakarma,Paras Agarwal
DOI: https://doi.org/10.1145/3441628
2021-07-22
Abstract:Recently, human activity recognition using skeleton data is increasing due to its ease of acquisition and finer shape details. Still, it suffers from a wide range of intra-class variation, inter-class similarity among the actions and view variation due to which extraction of discriminative spatial and temporal features is still a challenging problem. In this regard, we present a novel Residual Inception Attention Driven CNN (RIAC-Net) Network, which visualizes the dynamics of the action in a part-wise manner. The complete skeletonis partitioned into five key parts: Head to Spine, Left Leg, Right Leg, Left Hand, Right Hand. For each part, a Compact Action Skeleton Sequence (CASS) is defined. Part-wise skeleton-based motion dynamics highlights discriminative local features of the skeleton that helps to overcome the challenges of inter-class similarity and intra-class variation with improved recognition performance. The RIAC-Net architecture is inspired by the concept of inception-residual representation that unifies the Attention Driven Residues (ADR) with inception-based Spatio-Temporal Convolution Features (STCF) to learn efficient salient action features. An ablation study is also carried out to analyze the effect of ADR over simple residue-based action representation. The robustness of the proposed framework is evaluated by performing an extensive experiment on four challenging datasets: UT Kinect Action 3D, Florence 3D action, MSR Daily Action3D, and NTU RGB-D datasets, which consistently demonstrate the superiority of the proposed method over other state-of-the-art methods.
computer science, information systems, theory & methods, software engineering
What problem does this paper attempt to address?