Abstract:The complexity of state-of-the-art Transformer-based models for skeleton-based action recognition poses significant challenges in terms of computational efficiency and resource utilization. In this paper, we explore the application of Singular Value Decomposition (SVD) to effectively reduce the model sizes of these pre-trained models, aiming to minimize their resource consumption while preserving accuracy. Our method, LORTSAR (LOw-Rank Transformer for Skeleton-based Action Recognition), also includes a fine-tuning step to compensate for any potential accuracy degradation caused by model compression, and is applied to two leading Transformer-based models, "Hyperformer" and "STEP-CATFormer". Experimental results on the "NTU RGB+D" and "NTU RGB+D 120" datasets show that our method can reduce the number of model parameters substantially with negligible degradation or even performance increase in recognition accuracy. This confirms that SVD combined with post-compression fine-tuning can boost model efficiency, paving the way for more sustainable, lightweight, and high-performance technologies in human action recognition.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **In skeleton - based action recognition, the state - of - the - art Transformer models present significant challenges in terms of computational efficiency and resource utilization**. Specifically, although these models are powerful in performance, they have a large number of parameters, resulting in high computational costs and large resource consumption, and are difficult to be deployed and used in resource - constrained environments. ### Specific description of the problem 1. **High computational complexity**: Existing Transformer models (such as Hyperformer and STEP - CATFormer) perform excellently in skeleton - based action recognition tasks, but due to their complex architectures and a large number of parameters, they have high computational complexity and are difficult to run efficiently in practical applications. 2. **Large resource consumption**: These models require a large amount of computational resources (such as GPU memory), which limits their applications in edge devices or resource - constrained environments. 3. **Balance between accuracy and efficiency**: How to maintain or even improve the recognition accuracy of the model while reducing the model parameters and computational complexity is an urgent problem to be solved. ### Solution To solve the above problems, the paper proposes the **LORTSAR (LOw - Rank Transformer for Skeleton - based Action Recognition)** method, mainly through the following means: 1. **Low - Rank Approximation**: Use singular value decomposition (SVD) to perform low - rank approximation on the weight matrices in the Transformer model, thereby significantly reducing the number of model parameters. The formula is expressed as follows: \[ W_{\text{LR}}=U_k\Sigma_k V_k^T \] where \( W_{\text{LR}} \) is the low - rank approximated weight matrix, and \( U_k \), \( \Sigma_k \), and \( V_k \) are the first \( k \) singular values selected from the SVD decomposition of the original weight matrix \( W \) and their corresponding left and right singular vector matrices, respectively. 2. **Fine - tuning**: In order to compensate for the potential performance degradation caused by low - rank approximation, the paper proposes to fine - tune the compressed model. By adjusting the learning rate and other hyper - parameters, the model performance is further optimized to ensure that the accuracy is not lost while reducing the parameters, and the performance may even be improved. ### Experimental results The experimental results show that the LORTSAR method can maintain or even improve the accuracy of skeleton - based action recognition while significantly reducing the model parameters. For example, experiments on the NTU RGB + D and NTU RGB + D 120 datasets show that the number of model parameters is reduced by 97.6%, the computational complexity is reduced from 18.35 GFLOPs to 5.30 GFLOPs, and the Top - 1 accuracy in some settings even exceeds that of the original model. In conclusion, this paper aims to improve the computational efficiency and resource utilization of Transformer - based skeleton action recognition models through low - rank approximation and fine - tuning techniques, making them more suitable for practical application scenarios.

LORTSAR: Low-Rank Transformer for Skeleton-based Action Recognition

Multi-Scale Adaptive Skeleton Transformer for action recognition

Transformer for Skeleton-based Action Recognition: A Review of Recent Advances

A Skeleton-based Action Recognition System for Medical Condition Detection

Fusing Geometric Features for Skeleton-Based Action Recognition Using Multilayer LSTM Networks

ReL-SAR: Representation Learning for Skeleton Action Recognition with Convolutional Transformers and BYOL

Explorations of Skeleton Features for LSTM-based Action Recognition

Shifting Perspective to See Difference: A Novel Multi-View Method for Skeleton Based Action Recognition

STSD: spatial–temporal semantic decomposition transformer for skeleton-based action recognition

Skeleton-based Action Recognition Using LSTM and CNN

A Novel Two-Stream Transformer-Based Framework for Multi-Modality Human Action Recognition

STAR: Sparse Transformer-based Action Recognition

Multi-Modal Transformer with Skeleton and Text for Action Recognition

MSST-RT: Multi-Stream Spatial-Temporal Relative Transformer for Skeleton-Based Action Recognition

Spatial Temporal Transformer Network for Skeleton-based Action Recognition

Spatial-temporal Transformer-guided Diffusion based Data Augmentation for Efficient Skeleton-based Action Recognition

STEP CATFormer: Spatial-Temporal Effective Body-Part Cross Attention Transformer for Skeleton-based Action Recognition

Skeletal Keypoint-Based Transformer Model for Human Action Recognition in Aerial Videos

WLiT: Windows and Linear Transformer for Video Action Recognition

Evaluating Transformers for Lightweight Action Recognition

Modify Self-Attention Via Skeleton Decomposition for Effective Point Cloud Transformer.