Efficient Spatio-Temporal Contrastive Learning for Skeleton-Based 3D Action Recognition

Xuehao Gao,Yang Yang,Yimeng Zhang,Maosen Li,Jin-Gang Yu,Shaoyi Du
DOI: https://doi.org/10.1109/tmm.2021.3127040
IF: 7.3
2021-01-01
IEEE Transactions on Multimedia
Abstract:In this paper, we propose a simple yet effective self-supervised method called spatio-temporal contrastive learning (ST-CL) for 3D skeleton-based action recognition. ST-CL acquires action-specific features by regarding the spatio-temporal continuity of motion tendency as the supervisory signal. To yield effective representations, ST-CL first designs some novel contrastive proxy tasks by providing different spatio-temporal observation scenes for the same 3D action and pulling them together in the embedding space. Second, three key components are devised in the action encoding to efficiently extract representations in contrastive tasks: (1) Information Representation introduces the awareness of joint type when analyzing motion dynamics. (2) Non-local GCN learns a data-driven graph topology structure and promotes a spatial message passing among long-range joints in each frame. (3) Multi-Scale TCN makes larger receptive fields for capturing richer longe-range temporal dynamics amomg adjacent frames. In ST-CL, these effective proxy tasks yield useful representations and efficient action encoding further enhances the representation capacity. As validated on four large-scale datasets, ST-CL is a strong baseline with high performance and efficiency for the contrastive learning study of the skeleton data. Compared to previous self-supervised methods, the proposed ST-CL achieves significant improvement consistently with a smaller model size and better training efficiency.
computer science, information systems,telecommunications, software engineering
What problem does this paper attempt to address?