Localized Linear Temporal Dynamics for Self-Supervised Skeleton Action Recognition

Xinghan Wang,Yadong Mu
DOI: https://doi.org/10.1109/tmm.2024.3405712
IF: 7.3
2024-10-19
IEEE Transactions on Multimedia
Abstract:Self-supervised skeleton action recognition has gained notable attention for its reduced reliance on annotated data. Contrastive learning methods, in particular, have emerged as prominent approaches. These works typically utilize a spatial-temporal backbone to extract features from action sequences for contrast in the feature space. Yet, they often rely on average pooling for temporal feature aggregation, neglecting the intricate higher-order temporal dynamics of the sequences. In this work, we introduce Koopman Temporal Contrastive Learning (KTCL), a Koopman theory inspired contrastive learning framework, which focuses on the localized latent dynamics of the sequence by learning discriminative linear system dynamics. Given an action sequence, we first map it into a new space where the temporal evolution becomes linear. A dynamics-oriented contrastive loss is used to enforce the dynamics of positive (or negative) samples more similar (or dissimilar). To tackle the diverse dynamics across different action phases within one sequence, we further introduce segment-level localized linear dynamics, accompanied by a cross-matching mechanism for alignment. Additionally, a cross-order contrastive loss is proposed to further amplify the effect of contrast across features of different orders. Intensive experiments on four benchmark datasets show that the proposed methods achieve superior performance than competing methods.
computer science, information systems,telecommunications, software engineering
What problem does this paper attempt to address?