LORTSAR: Low-Rank Transformer for Skeleton-based Action Recognition

Soroush Oraki,Harry Zhuang,Jie Liang
2024-07-20
Abstract:The complexity of state-of-the-art Transformer-based models for skeleton-based action recognition poses significant challenges in terms of computational efficiency and resource utilization. In this paper, we explore the application of Singular Value Decomposition (SVD) to effectively reduce the model sizes of these pre-trained models, aiming to minimize their resource consumption while preserving accuracy. Our method, LORTSAR (LOw-Rank Transformer for Skeleton-based Action Recognition), also includes a fine-tuning step to compensate for any potential accuracy degradation caused by model compression, and is applied to two leading Transformer-based models, "Hyperformer" and "STEP-CATFormer". Experimental results on the "NTU RGB+D" and "NTU RGB+D 120" datasets show that our method can reduce the number of model parameters substantially with negligible degradation or even performance increase in recognition accuracy. This confirms that SVD combined with post-compression fine-tuning can boost model efficiency, paving the way for more sustainable, lightweight, and high-performance technologies in human action recognition.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **In skeleton - based action recognition, the state - of - the - art Transformer models present significant challenges in terms of computational efficiency and resource utilization**. Specifically, although these models are powerful in performance, they have a large number of parameters, resulting in high computational costs and large resource consumption, and are difficult to be deployed and used in resource - constrained environments. ### Specific description of the problem 1. **High computational complexity**: Existing Transformer models (such as Hyperformer and STEP - CATFormer) perform excellently in skeleton - based action recognition tasks, but due to their complex architectures and a large number of parameters, they have high computational complexity and are difficult to run efficiently in practical applications. 2. **Large resource consumption**: These models require a large amount of computational resources (such as GPU memory), which limits their applications in edge devices or resource - constrained environments. 3. **Balance between accuracy and efficiency**: How to maintain or even improve the recognition accuracy of the model while reducing the model parameters and computational complexity is an urgent problem to be solved. ### Solution To solve the above problems, the paper proposes the **LORTSAR (LOw - Rank Transformer for Skeleton - based Action Recognition)** method, mainly through the following means: 1. **Low - Rank Approximation**: Use singular value decomposition (SVD) to perform low - rank approximation on the weight matrices in the Transformer model, thereby significantly reducing the number of model parameters. The formula is expressed as follows: \[ W_{\text{LR}}=U_k\Sigma_k V_k^T \] where \( W_{\text{LR}} \) is the low - rank approximated weight matrix, and \( U_k \), \( \Sigma_k \), and \( V_k \) are the first \( k \) singular values selected from the SVD decomposition of the original weight matrix \( W \) and their corresponding left and right singular vector matrices, respectively. 2. **Fine - tuning**: In order to compensate for the potential performance degradation caused by low - rank approximation, the paper proposes to fine - tune the compressed model. By adjusting the learning rate and other hyper - parameters, the model performance is further optimized to ensure that the accuracy is not lost while reducing the parameters, and the performance may even be improved. ### Experimental results The experimental results show that the LORTSAR method can maintain or even improve the accuracy of skeleton - based action recognition while significantly reducing the model parameters. For example, experiments on the NTU RGB + D and NTU RGB + D 120 datasets show that the number of model parameters is reduced by 97.6%, the computational complexity is reduced from 18.35 GFLOPs to 5.30 GFLOPs, and the Top - 1 accuracy in some settings even exceeds that of the original model. In conclusion, this paper aims to improve the computational efficiency and resource utilization of Transformer - based skeleton action recognition models through low - rank approximation and fine - tuning techniques, making them more suitable for practical application scenarios.