Enhanced discriminative graph convolutional network with adaptive temporal modelling for skeleton-based action recognition

Tamam Alsarhan,Usman Ali,Hongtao Lu
DOI: https://doi.org/10.1016/j.cviu.2021.103348
IF: 4.886
2022-01-01
Computer Vision and Image Understanding
Abstract:Graph convolutional networks (GCNs) have achieved promising results in skeleton-based action recognition due to their capability in analysing irregular grids with non-Euclidean geometry. Considering the fact that skeleton based action recognition is a classification problem, the suitable graph representation in GCN-based approaches is the key for the ultimate goal of classification. Despite the practical success of GCN-based approaches for solving this problem over the past few years, learning better representation is still a challenging issue and the existing approaches fail in distinguishing similar actions. Besides, most existing GCN-based frameworks focus on modelling the spatial information and use one fixed kernel to model the temporal information. Such modelling does not pay enough attention to diversifying representations among different skeleton frames, leading to inefficiency in obtaining more discriminative temporal features for different actions and therefore, such modelling is inconvenient with the diversity of human movements. Our main concern in this work is the adaptive feature extraction of highly discriminative information for both the spatial and the temporal dimensions. To achieve that, a novel Enhanced Discriminative Graph Convolutional Network (ED-GCN) based on the attention mechanism for skeleton-based action recognition is proposed. Discriminative channel-wise features are obtained by fusing the Squeeze and Excitation (SE) module to the GCN to selectively enhance the significant features and suppress the non-significant ones. The adaptively enhanced feature map is then fused to the graph convolutional layer to improve the capability of learning better representation. For the temporal dimension inspired by temporal modelling in videos, we introduce our adaptive temporal modelling block (ATB), which is able to flexibly capture temporal structure for skeleton-based action recognition. Here, the proposed ATB is a two-stage module comprising re-calibration and motion-interaction stage, designed to learn temporal features by integrating the modelling of channel correlation and temporal evolution, respectively. Experimental results on two large-scale datasets, NTU-RGB+D and Kinetics-skeleton demonstrate the importance of discriminatively learned information and the effectiveness of the proposed ED-GCN for skeleton-based action recognition.
What problem does this paper attempt to address?