Global–local Contrastive Multiview Representation Learning for Skeleton-Based Action Recognition

Cunling Bian,Wei Feng,Fanbo Meng,Song Wang
DOI: https://doi.org/10.1016/j.cviu.2023.103655
IF: 4.886
2023-01-01
Computer Vision and Image Understanding
Abstract:Skeleton-based human action recognition has been drawing more interest recently due to its low sensitivity to appearance changes and the accessibility of more skeleton data. However, the skeletons captured in practice are sensitive to the view of an actor, given the occlusion of different human-body joints and the errors in human joint localization. Each view is noisy and incomplete, but important factors, such as motion and semantics, should be shared between all views in action representation learning. We support the classic hypothesis that a powerful representation is one that models view-invariant factors, and so does unsupervised learning. Therefore, we study this hypothesis under the framework of contrastive multiview learning, where we learn a representation for action recognition that aims to maximize the mutual information between different views of the same action sequence. Apart from that, a global–local contrastive loss is proposed to model the multi-scale co-occurrence relationships in both spatial and temporal domains. Extensive experimental results show that the proposed method significantly boosts the performance of unsupervised skeleton-based human action methods on three challenging benchmarks of PKUMMD, NTU RGB+D 60, and NTU RGB+D 120.
What problem does this paper attempt to address?