Abstract:To extract robust and generalizable skeleton action recognition features, large amounts of well-curated data are typically required, which is a challenging task hindered by annotation and computation costs. Therefore, unsupervised representation learning is of prime importance to leverage unlabeled skeleton data. In this work, we investigate unsupervised representation learning for skeleton action recognition. For this purpose, we designed a lightweight convolutional transformer framework, named ReL-SAR, exploiting the complementarity of convolutional and attention layers for jointly modeling spatial and temporal cues in skeleton sequences. We also use a Selection-Permutation strategy for skeleton joints to ensure more informative descriptions from skeletal data. Finally, we capitalize on Bootstrap Your Own Latent (BYOL) to learn robust representations from unlabeled skeleton sequence data. We achieved very competitive results on limited-size datasets: MCAD, IXMAS, JHMDB, and NW-UCLA, showing the effectiveness of our proposed method against state-of-the-art methods in terms of both performance and computational efficiency. To ensure reproducibility and reusability, the source code including all implementation parameters is provided at: <a class="link-external link-https" href="https://github.com/SafwenNaimi/Representation-Learning-for-Skeleton-Action-Recognition-with-Convolutional-Transformers-and-BYOL" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in Skeleton Action Recognition, how to use unlabeled skeleton data for effective unsupervised representation learning, thereby reducing the dependence on large - scale labeled data. Specifically, the authors propose a lightweight convolutional - transformer framework ReL - SAR, which combines the advantages of convolutional layers and attention mechanisms to model spatial and temporal cues simultaneously. In addition, they introduce a Selection - Permutation strategy to ensure more abundant descriptive information is extracted from skeleton data, and use the Bootstrap Your Own Latent (BYOL) method to learn robust representations from unlabeled skeleton sequence data. ### Problem Background Human Activity Recognition (HAR) is widely used in fields such as video surveillance, sports analysis, robotics, and health monitoring. However, due to multiple factors in the real - world environment (such as different viewing angles, differences in actions among different people, and unfavorable recording conditions), accurately recognizing various human activities remains challenging. Traditional fully - supervised methods require a large number of labeled datasets, which are not only time - consuming but also costly. Therefore, exploring self - supervised learning methods to use unlabeled data for feature learning has become an important research direction. ### Solution To solve the above problems, the authors propose the following solutions: 1. **Lightweight Convolutional - Transformer Model (ReL - SAR)**: By combining the spatial feature extraction ability of convolutional layers and the time - dependency modeling ability of transformers, an efficient skeleton action recognition framework is designed. 2. **Selection - Permutation Strategy**: By selecting and re - arranging skeleton joints, removing irrelevant joints and ensuring that the joint order in the input sequence is more meaningful, the model performance is enhanced. 3. **Bootstrap Your Own Latent (BYOL)**: Use the BYOL method to learn robust low - level representations from a large amount of unlabeled skeleton sequence data, which can be further used for fine - tuning in downstream tasks. ### Experimental Results The authors conducted experiments on multiple publicly available small - scale datasets (MCAD, IXMAS, JHMDB, and NW - UCLA) to verify the effectiveness and computational efficiency of the proposed method. The results show that ReL - SAR achieved very competitive results on these datasets, especially significantly outperforming the existing state - of - the - art methods in terms of computational resource consumption. ### Formula Summary Some of the key formulas involved in the paper include: - **Multi - Head Self - Attention (MSA)**: \[ Q = XW_Q, \quad K = XW_K, \quad V = XW_V \] where \( W_Q, W_K, W_V\in\mathbb{R}^{D_{\text{model}}\times D_h} \), and \( D_h \) is the dimension of the attention head. - **BYOL Loss Function**: \[ L_{\theta, \xi}=\left\| q_\theta(z_\theta)-\bar{z}'_\xi\right\|^2_2 = 2 - 2\cdot\frac{\langle q_\theta(z_\theta), z'_\xi\rangle}{\| q_\theta(z_\theta)\|_2\cdot\| z'_\xi\|_2} \] The final loss is defined as: \[ L_{\text{BYOL}}^{\theta, \xi}=L_{\theta, \xi}+L'_{\theta, \xi} \] These formulas show the core computational processes of the model, ensuring that readers can understand the specific implementation details of the algorithm.

ReL-SAR: Representation Learning for Skeleton Action Recognition with Convolutional Transformers and BYOL

Contrast-reconstruction Representation Learning for Self-supervised Skeleton-based Action Recognition

Self-Supervised 3D Skeleton Representation Learning with Active Sampling and Adaptive Relabeling for Action Recognition

Sparse Semi-Supervised Action Recognition with Active Learning

LORTSAR: Low-Rank Transformer for Skeleton-based Action Recognition

Balanced Representation Learning for Long-tailed Skeleton-based Action Recognition

Multi-Scale Adaptive Skeleton Transformer for action recognition

Learning to Recognize 3D Human Action from A New Skeleton-based Representation Using Deep Convolutional Neural Networks

Transformer for Skeleton-based Action Recognition: A Review of Recent Advances

Unveiling the Hidden Realm: Self-supervised Skeleton-based Action Recognition in Occluded Environments

Human Action Recognition of Spatiotemporal Parameters for Skeleton Sequences Using MTLN Feature Learning Framework

Relational Network for Skeleton-Based Action Recognition

Language Supervised Human Action Recognition with Salient Fusion: Construction Worker Action Recognition as a Use Case

Skeleton-Contrastive 3D Action Representation Learning

Augmented Skeleton Based Contrastive Action Learning with Momentum LSTM for Unsupervised Action Recognition

STEP CATFormer: Spatial-Temporal Effective Body-Part Cross Attention Transformer for Skeleton-based Action Recognition

Actionlet-Dependent Contrastive Learning for Unsupervised Skeleton-Based Action Recognition

ActNetFormer: Transformer-ResNet Hybrid Method for Semi-Supervised Action Recognition in Videos

Navigating Open Set Scenarios for Skeleton-based Action Recognition

Simba: Mamba augmented U-ShiftGCN for Skeletal Action Recognition in Videos

Representation modeling learning with multi-domain decoupling for unsupervised skeleton-based action recognition