Abstract:Action recognition via 3D skeleton data is an emerging important topic. Most existing methods rely on hand-crafted descriptors to recognize actions, or perform supervised action representation learning with massive labels. In this paper, we for the first time propose a contrastive action learning paradigm named AS-CAL that exploits different augmentations of unlabeled skeleton sequences to learn action representations in an unsupervised manner. Specifically, we first propose to contrast similarity between augmented instances of the input skeleton sequence, which are transformed with multiple novel augmentation strategies, to learn inherent action patterns ("pattern-invariance") in different skeleton transformations. Second, to encourage learning the pattern-invariance with more consistent action representations, we propose a momentum LSTM, which is implemented as the momentum-based moving average of LSTM based query encoder, to encode long-term action dynamics of the key sequence. Third, we introduce a queue to store the encoded keys, which allows flexibly reusing proceeding keys to build a consistent dictionary to facilitate contrastive learning. Last, we propose a novel representation named Contrastive Action Encoding (CAE) to represent human's action effectively. Empirical evaluations show that our approach significantly outperforms hand-crafted methods by 10-50% Top-1 accuracy, and it can even achieve superior performance to many supervised learning methods<a class="workspace-trigger" href="#fn2">2</a>.

Spatial-Temporal Data Augmentation Based on LSTM Autoencoder Network for Skeleton-Based Human Action Recognition

Explorations of Skeleton Features for LSTM-based Action Recognition

An Attention Enhanced Graph Convolutional LSTM Network for Skeleton-Based Action Recognition

Fusing Geometric Features for Skeleton-Based Action Recognition Using Multilayer LSTM Networks

An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data

Spatial-temporal Transformer-guided Diffusion based Data Augmentation for Efficient Skeleton-based Action Recognition

Deep spatiotemporal LSTM network with temporal pattern feature for 3D human action recognition

Skeleton-based Action Recognition Using LSTM and CNN

Learning Explicit Shape And Motion Evolution Maps For Skeleton-Based Human Action Recognition

Spatio-Temporal Attention Deep Network for Skeleton Based View-Invariant Human Action Recognition

Skeleton-Based Human Action Recognition with Global Context-Aware Attention LSTM Networks

Sample Fusion Network: an End-to-End Data Augmentation Network for Skeleton-Based Human Action Recognition.

Skeleton-Based Human Action Recognition Using Spatial Temporal 3D Convolutional Neural Networks

Enhancing Human Action Recognition with 3D Skeleton Data: A Comprehensive Study of Deep Learning and Data Augmentation

Spatial-Temporal Adaptive Metric Learning Network for One-Shot Skeleton-Based Action Recognition

Augmented Skeleton Based Contrastive Action Learning with Momentum LSTM for Unsupervised Action Recognition

Spatial Temporal Transformer Network for Skeleton-based Action Recognition

Spatio-Temporal Attention-Based LSTM Networks for 3D Action Recognition and Detection

View Adaptive Recurrent Neural Networks for High Performance Human Action Recognition from Skeleton Data

Skeleton-Based Action Recognition with Spatial Reasoning and Temporal Stack Learning

Skeleton-based action recognition with hierarchical spatial reasoning and temporal stack learning network