ReL-SAR: Representation Learning for Skeleton Action Recognition with Convolutional Transformers and BYOL

Safwen Naimi,Wassim Bouachir,Guillaume-Alexandre Bilodeau
DOI: https://doi.org/10.48550/arXiv.2409.05749
2024-09-10
Abstract:To extract robust and generalizable skeleton action recognition features, large amounts of well-curated data are typically required, which is a challenging task hindered by annotation and computation costs. Therefore, unsupervised representation learning is of prime importance to leverage unlabeled skeleton data. In this work, we investigate unsupervised representation learning for skeleton action recognition. For this purpose, we designed a lightweight convolutional transformer framework, named ReL-SAR, exploiting the complementarity of convolutional and attention layers for jointly modeling spatial and temporal cues in skeleton sequences. We also use a Selection-Permutation strategy for skeleton joints to ensure more informative descriptions from skeletal data. Finally, we capitalize on Bootstrap Your Own Latent (BYOL) to learn robust representations from unlabeled skeleton sequence data. We achieved very competitive results on limited-size datasets: MCAD, IXMAS, JHMDB, and NW-UCLA, showing the effectiveness of our proposed method against state-of-the-art methods in terms of both performance and computational efficiency. To ensure reproducibility and reusability, the source code including all implementation parameters is provided at: <a class="link-external link-https" href="https://github.com/SafwenNaimi/Representation-Learning-for-Skeleton-Action-Recognition-with-Convolutional-Transformers-and-BYOL" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in Skeleton Action Recognition, how to use unlabeled skeleton data for effective unsupervised representation learning, thereby reducing the dependence on large - scale labeled data. Specifically, the authors propose a lightweight convolutional - transformer framework ReL - SAR, which combines the advantages of convolutional layers and attention mechanisms to model spatial and temporal cues simultaneously. In addition, they introduce a Selection - Permutation strategy to ensure more abundant descriptive information is extracted from skeleton data, and use the Bootstrap Your Own Latent (BYOL) method to learn robust representations from unlabeled skeleton sequence data. ### Problem Background Human Activity Recognition (HAR) is widely used in fields such as video surveillance, sports analysis, robotics, and health monitoring. However, due to multiple factors in the real - world environment (such as different viewing angles, differences in actions among different people, and unfavorable recording conditions), accurately recognizing various human activities remains challenging. Traditional fully - supervised methods require a large number of labeled datasets, which are not only time - consuming but also costly. Therefore, exploring self - supervised learning methods to use unlabeled data for feature learning has become an important research direction. ### Solution To solve the above problems, the authors propose the following solutions: 1. **Lightweight Convolutional - Transformer Model (ReL - SAR)**: By combining the spatial feature extraction ability of convolutional layers and the time - dependency modeling ability of transformers, an efficient skeleton action recognition framework is designed. 2. **Selection - Permutation Strategy**: By selecting and re - arranging skeleton joints, removing irrelevant joints and ensuring that the joint order in the input sequence is more meaningful, the model performance is enhanced. 3. **Bootstrap Your Own Latent (BYOL)**: Use the BYOL method to learn robust low - level representations from a large amount of unlabeled skeleton sequence data, which can be further used for fine - tuning in downstream tasks. ### Experimental Results The authors conducted experiments on multiple publicly available small - scale datasets (MCAD, IXMAS, JHMDB, and NW - UCLA) to verify the effectiveness and computational efficiency of the proposed method. The results show that ReL - SAR achieved very competitive results on these datasets, especially significantly outperforming the existing state - of - the - art methods in terms of computational resource consumption. ### Formula Summary Some of the key formulas involved in the paper include: - **Multi - Head Self - Attention (MSA)**: \[ Q = XW_Q, \quad K = XW_K, \quad V = XW_V \] where \( W_Q, W_K, W_V\in\mathbb{R}^{D_{\text{model}}\times D_h} \), and \( D_h \) is the dimension of the attention head. - **BYOL Loss Function**: \[ L_{\theta, \xi}=\left\| q_\theta(z_\theta)-\bar{z}'_\xi\right\|^2_2 = 2 - 2\cdot\frac{\langle q_\theta(z_\theta), z'_\xi\rangle}{\| q_\theta(z_\theta)\|_2\cdot\| z'_\xi\|_2} \] The final loss is defined as: \[ L_{\text{BYOL}}^{\theta, \xi}=L_{\theta, \xi}+L'_{\theta, \xi} \] These formulas show the core computational processes of the model, ensuring that readers can understand the specific implementation details of the algorithm.