Abstract:Skeleton representation has attracted a great deal of attention recently as an extremely robust feature for human action recognition. However, its non-Euclidean structural characteristics raise new challenges for conventional solutions. Recent studies have shown that there is a native superiority in modeling spatiotemporal skeleton information with a Graph Convolutional Network (GCN). Nevertheless, the skeleton graph modeling normally focuses on the physical adjacency of the elements of the human skeleton sequence, which contrasts with the requirement to provide a perceptually meaningful representation. To address this problem, in this paper, we propose a perceptually-enriched graph learning method by introducing innovative features to spatial and temporal skeleton graph modeling. For the spatial information modeling, we incorporate a Local-Global Graph Convolutional Network (LG-GCN) that builds a multifaceted spatial perceptual representation. This helps to overcome the limitations caused by over-reliance on the spatial adjacency relationships in the skeleton. For temporal modeling, we present a Region-Aware Graph Convolutional Network (RA-GCN), which directly embeds the regional relationships conveyed by a skeleton sequence into a temporal graph model. This innovation mitigates the deficiency of the original skeleton graph models. In addition, we strengthened the ability of the proposed channel modeling methods to extract multi-scale representations. These innovations result in a lightweight graph convolutional model, referred to as Graph2Net, that simultaneously extends the spatial and temporal perceptual fields, and thus enhances the capacity of the graph model to represent skeleton sequences. We conduct extensive experiments on NTU-RGB+D 60&120, Northwestern-UCLA, and Kinetics-400 datasets to show that our results surpass the performance of several mainstream methods while limiting the model complexity and computational ov-rhead.

What problem does this paper attempt to address?

The problem this paper attempts to address is: **How to improve the perceptual capability and performance of skeleton-based action recognition models**. Specifically, existing methods based on Graph Convolutional Networks (GCN) typically rely on simple physical adjacency relationships when processing skeleton data, which limits the model's perceptual range and information acquisition capability. The paper points out that these methods mainly focus on local adjacency relationships in spatial graph modeling and only consider the relationships between the same joints in temporal graph modeling, neglecting the associative information between different regions. Therefore, these methods are insufficient in capturing meaningful information. To address these issues, the paper proposes a Perceptually-Enriched Graph Learning method, which expands the model's perceptual range by introducing innovative spatial and temporal graph modeling techniques. The specific contributions include: 1. **Spatial Graph Modeling**: - **Local-Global Graph Convolutional Network (LG-GCN)**: Combines local and global relationship modeling to overcome the limitations of over-reliance on spatial adjacency relationships. - **Multi-channel Feature Representation**: Enhances the model's perceptual capability through multi-channel segmentation and fusion. 2. **Temporal Graph Modeling**: - **Region-Aware Graph Convolutional Network (RA-GCN)**: Considers not only the temporal relationships of the same joints but also extends to regional relationship modeling, improving temporal modeling capability. 3. **Lightweight Model Design**: - Constructs a lightweight graph convolutional model (Graph2Net) through multi-channel feature representation and efficient information fusion, enhancing performance while maintaining low model complexity and computational overhead. The paper validates the effectiveness of the proposed method through experiments on multiple datasets, showing that its performance surpasses existing mainstream methods.

Graph2Net: Perceptually-Enriched Graph Learning for Skeleton-Based Action Recognition

Spatio-Temporal Inception Graph Convolutional Networks for Skeleton-Based Action Recognition.

Pose-Guided Graph Convolutional Networks for Skeleton-Based Action Recognition

Multi-Stage Attention-Enhanced Sparse Graph Convolutional Network for Skeleton-Based Action Recognition

Optimized Skeleton-based Action Recognition via Sparsified Graph Regression

Multi-Scale Spatial Temporal Graph Convolutional Network for Skeleton-Based Action Recognition

Learning Multi-Granular Spatio-Temporal Graph Network for Skeleton-based Action Recognition

Generalized Graph Convolutional Networks for Skeleton-based Action Recognition

Adaptive Attention Memory Graph Convolutional Networks for Skeleton-Based Action Recognition

TSGCNeXt: Dynamic-Static Multi-Graph Convolution for Efficient Skeleton-Based Action Recognition with Long-term Learning Potential

DeGCN: Deformable Graph Convolutional Networks for Skeleton-Based Action Recognition

Skeleton-Based Action Recognition with Multi-Stream Adaptive Graph Convolutional Networks

Graph Instinctive Attention Convolutional Network for Skeleton-Based Action Recognition.

Skeleton action recognition via graph convolutional network with self-attention module

SelfGCN: Graph Convolution Network With Self-Attention for Skeleton-Based Action Recognition

Feedback Graph Convolutional Network for Skeleton-Based Action Recognition

MFGCN: an efficient graph convolutional network based on multi-order feature information for human skeleton action recognition

Hypergraph Neural Network for Skeleton-Based Action Recognition

Richly Activated Graph Convolutional Network for Robust Skeleton-Based Action Recognition

Feature reconstruction graph convolutional network for skeleton-based action recognition

Priori separation graph convolution with long-short term temporal modeling for skeleton-based action recognition