Graph2Net: Perceptually-Enriched Graph Learning for Skeleton-Based Action Recognition

Cong Wu,Xiao-Jun Wu,Josef Kittler
DOI: https://doi.org/10.1109/tcsvt.2021.3085959
IF: 5.859
2022-04-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Skeleton representation has attracted a great deal of attention recently as an extremely robust feature for human action recognition. However, its non-Euclidean structural characteristics raise new challenges for conventional solutions. Recent studies have shown that there is a native superiority in modeling spatiotemporal skeleton information with a Graph Convolutional Network (GCN). Nevertheless, the skeleton graph modeling normally focuses on the physical adjacency of the elements of the human skeleton sequence, which contrasts with the requirement to provide a perceptually meaningful representation. To address this problem, in this paper, we propose a perceptually-enriched graph learning method by introducing innovative features to spatial and temporal skeleton graph modeling. For the spatial information modeling, we incorporate a Local-Global Graph Convolutional Network (LG-GCN) that builds a multifaceted spatial perceptual representation. This helps to overcome the limitations caused by over-reliance on the spatial adjacency relationships in the skeleton. For temporal modeling, we present a Region-Aware Graph Convolutional Network (RA-GCN), which directly embeds the regional relationships conveyed by a skeleton sequence into a temporal graph model. This innovation mitigates the deficiency of the original skeleton graph models. In addition, we strengthened the ability of the proposed channel modeling methods to extract multi-scale representations. These innovations result in a lightweight graph convolutional model, referred to as Graph2Net, that simultaneously extends the spatial and temporal perceptual fields, and thus enhances the capacity of the graph model to represent skeleton sequences. We conduct extensive experiments on NTU-RGB+D 60&120, Northwestern-UCLA, and Kinetics-400 datasets to show that our results surpass the performance of several mainstream methods while limiting the model complexity and computational ov-rhead.
engineering, electrical & electronic
What problem does this paper attempt to address?
The problem this paper attempts to address is: **How to improve the perceptual capability and performance of skeleton-based action recognition models**. Specifically, existing methods based on Graph Convolutional Networks (GCN) typically rely on simple physical adjacency relationships when processing skeleton data, which limits the model's perceptual range and information acquisition capability. The paper points out that these methods mainly focus on local adjacency relationships in spatial graph modeling and only consider the relationships between the same joints in temporal graph modeling, neglecting the associative information between different regions. Therefore, these methods are insufficient in capturing meaningful information. To address these issues, the paper proposes a Perceptually-Enriched Graph Learning method, which expands the model's perceptual range by introducing innovative spatial and temporal graph modeling techniques. The specific contributions include: 1. **Spatial Graph Modeling**: - **Local-Global Graph Convolutional Network (LG-GCN)**: Combines local and global relationship modeling to overcome the limitations of over-reliance on spatial adjacency relationships. - **Multi-channel Feature Representation**: Enhances the model's perceptual capability through multi-channel segmentation and fusion. 2. **Temporal Graph Modeling**: - **Region-Aware Graph Convolutional Network (RA-GCN)**: Considers not only the temporal relationships of the same joints but also extends to regional relationship modeling, improving temporal modeling capability. 3. **Lightweight Model Design**: - Constructs a lightweight graph convolutional model (Graph2Net) through multi-channel feature representation and efficient information fusion, enhancing performance while maintaining low model complexity and computational overhead. The paper validates the effectiveness of the proposed method through experiments on multiple datasets, showing that its performance surpasses existing mainstream methods.