JOLO-GCN: Mining Joint-Centered Light-Weight Information for Skeleton-Based Action Recognition

Jinmiao Cai,Nianjuan Jiang,Xiaoguang Han,Kui Jia,Jiangbo Lu
DOI: https://doi.org/10.48550/arXiv.2011.07787
2020-11-16
Abstract:Skeleton-based action recognition has attracted research attentions in recent years. One common drawback in currently popular skeleton-based human action recognition methods is that the sparse skeleton information alone is not sufficient to fully characterize human motion. This limitation makes several existing methods incapable of correctly classifying action categories which exhibit only subtle motion differences. In this paper, we propose a novel framework for employing human pose skeleton and joint-centered light-weight information jointly in a two-stream graph convolutional network, namely, JOLO-GCN. Specifically, we use Joint-aligned optical Flow Patches (JFP) to capture the local subtle motion around each joint as the pivotal joint-centered visual information. Compared to the pure skeleton-based baseline, this hybrid scheme effectively boosts performance, while keeping the computational and memory overheads low. Experiments on the NTU RGB+D, NTU RGB+D 120, and the Kinetics-Skeleton dataset demonstrate clear accuracy improvements attained by the proposed method over the state-of-the-art skeleton-based methods.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in skeleton - based human action recognition, using only sparse skeleton information is not sufficient to fully describe human motion, resulting in some existing methods being unable to correctly classify those action categories with only slight action differences. Specifically, the paper points out: 1. **Difficult to Capture Local Subtle Action Features**: For action categories mainly defined by local subtle action features (such as "shaking one's head"), the difference between skeletons extracted from two consecutive frames is very small, which is almost useless for describing such actions. In addition, when the body movement of an action is weak, this local subtle action feature is easily masked by noisy pose estimation. For example, for actions such as "reading" and "writing", skeleton - based single - modality methods seem to have reached a performance bottleneck. 2. **Skeleton Representation is not Distinctive Enough**: For some action categories, it may not be possible to distinguish them well only by relying on the skeleton. For example, the skeleton sequence representations of the two actions "pointing at something" and "taking a selfie" are very similar, so skeleton - based single - modality methods are prone to confusing these categories. To overcome the above limitations, the paper proposes a new framework. By jointly using the human pose skeleton and lightweight visual information of the joint center (Joint - aligned optical Flow Patches, JFP) to enhance the skeleton information, the performance is improved without significantly increasing the computational and memory overhead. Specifically, JFP is used to capture the local subtle motion around each joint as the key visual information of the joint center. Experimental results show that this method significantly improves the recognition accuracy on multiple benchmark datasets.