Abstract:Skeleton-based action recognition has attracted research attentions in recent years. One common drawback in currently popular skeleton-based human action recognition methods is that the sparse skeleton information alone is not sufficient to fully characterize human motion. This limitation makes several existing methods incapable of correctly classifying action categories which exhibit only subtle motion differences. In this paper, we propose a novel framework for employing human pose skeleton and joint-centered light-weight information jointly in a two-stream graph convolutional network, namely, JOLO-GCN. Specifically, we use Joint-aligned optical Flow Patches (JFP) to capture the local subtle motion around each joint as the pivotal joint-centered visual information. Compared to the pure skeleton-based baseline, this hybrid scheme effectively boosts performance, while keeping the computational and memory overheads low. Experiments on the NTU RGB+D, NTU RGB+D 120, and the Kinetics-Skeleton dataset demonstrate clear accuracy improvements attained by the proposed method over the state-of-the-art skeleton-based methods.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in skeleton - based human action recognition, using only sparse skeleton information is not sufficient to fully describe human motion, resulting in some existing methods being unable to correctly classify those action categories with only slight action differences. Specifically, the paper points out: 1. **Difficult to Capture Local Subtle Action Features**: For action categories mainly defined by local subtle action features (such as "shaking one's head"), the difference between skeletons extracted from two consecutive frames is very small, which is almost useless for describing such actions. In addition, when the body movement of an action is weak, this local subtle action feature is easily masked by noisy pose estimation. For example, for actions such as "reading" and "writing", skeleton - based single - modality methods seem to have reached a performance bottleneck. 2. **Skeleton Representation is not Distinctive Enough**: For some action categories, it may not be possible to distinguish them well only by relying on the skeleton. For example, the skeleton sequence representations of the two actions "pointing at something" and "taking a selfie" are very similar, so skeleton - based single - modality methods are prone to confusing these categories. To overcome the above limitations, the paper proposes a new framework. By jointly using the human pose skeleton and lightweight visual information of the joint center (Joint - aligned optical Flow Patches, JFP) to enhance the skeleton information, the performance is improved without significantly increasing the computational and memory overhead. Specifically, JFP is used to capture the local subtle motion around each joint as the key visual information of the joint center. Experimental results show that this method significantly improves the recognition accuracy on multiple benchmark datasets.

JOLO-GCN: Mining Joint-Centered Light-Weight Information for Skeleton-Based Action Recognition

Richly Activated Graph Convolutional Network for Robust Skeleton-Based Action Recognition

Pose-Guided Graph Convolutional Networks for Skeleton-Based Action Recognition

Optimized Skeleton-based Action Recognition via Sparsified Graph Regression

Temporal Enhanced Multi-Stream Graph Convolutional Nerual Networks For Skeleton-Based Action Recognition

Combining channel-wise joint attention and temporal attention in graph convolutional networks for skeleton-based action recognition

An improved spatial temporal graph convolutional network for robust skeleton-based action recognition

Multi-Stage Attention-Enhanced Sparse Graph Convolutional Network for Skeleton-Based Action Recognition

Skeleton action recognition via graph convolutional network with self-attention module

Skeleton-Indexed Deep Multi-Modal Feature Learning for High Performance Human Action Recognition

Lightweight Multi-Scale Spatiotemporal Graph Convolutional Network for Skeleton-Based Action Recognition

MFGCN: an efficient graph convolutional network based on multi-order feature information for human skeleton action recognition

Generalized Graph Convolutional Networks for Skeleton-based Action Recognition

Actional-Structural Graph Convolutional Networks for Skeleton-based Action Recognition

A Multi-Stream Graph Convolutional Networks-Hidden Conditional Random Field Model for Skeleton-Based Action Recognition

Convolutional Relation Network for Skeleton-Based Action Recognition.

Skeleton-Based Action Recognition With Low-Level Features of Adaptive Graph Convolutional Networks

Lighter and faster: A multi-scale adaptive graph convolutional network for skeleton-based action recognition

An Attention Enhanced Graph Convolutional LSTM Network for Skeleton-Based Action Recognition

Multi-Scale Adaptive Aggregate Graph Convolutional Network for Skeleton-Based Action Recognition

Multi-stream P&U adaptive graph convolutional networks for skeleton-based action recognition