SVFormer: A Direct Training Spiking Transformer for Efficient Video Action Recognition

Liutao Yu,Liwei Huang,Chenlin Zhou,Han Zhang,Zhengyu Ma,Huihui Zhou,Yonghong Tian
2024-06-21
Abstract:Video action recognition (VAR) plays crucial roles in various domains such as surveillance, healthcare, and industrial automation, making it highly significant for the society. Consequently, it has long been a research spot in the computer vision field. As artificial neural networks (ANNs) are flourishing, convolution neural networks (CNNs), including 2D-CNNs and 3D-CNNs, as well as variants of the vision transformer (ViT), have shown impressive performance on VAR. However, they usually demand huge computational cost due to the large data volume and heavy information redundancy introduced by the temporal dimension. To address this challenge, some researchers have turned to brain-inspired spiking neural networks (SNNs), such as recurrent SNNs and ANN-converted SNNs, leveraging their inherent temporal dynamics and energy efficiency. Yet, current SNNs for VAR also encounter limitations, such as nontrivial input preprocessing, intricate network construction/training, and the need for repetitive processing of the same video clip, hindering their practical deployment. In this study, we innovatively propose the directly trained SVFormer (Spiking Video transFormer) for VAR. SVFormer integrates local feature extraction, global self-attention, and the intrinsic dynamics, sparsity, and spike-driven nature of SNNs, to efficiently and effectively extract spatio-temporal features. We evaluate SVFormer on two RGB datasets (UCF101, NTU-RGBD60) and one neuromorphic dataset (DVS128-Gesture), demonstrating comparable performance to the mainstream models in a more efficient way. Notably, SVFormer achieves a top-1 accuracy of 84.03% with ultra-low power consumption (21 mJ/video) on UCF101, which is state-of-the-art among directly trained deep SNNs, showcasing significant advantages over prior models.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve several key problems in video action recognition (VAR): 1. **High computational cost**: Existing artificial neural networks (ANNs), such as convolutional neural networks (CNNs) and vision transformers (ViTs), usually require huge computational resources when processing video data due to the large amount of data, high information redundancy, and the complexity introduced by the time dimension. 2. **Limitations of existing SNNs**: Although some research has turned to brain - inspired spiking neural networks (SNNs) to take advantage of their inherent temporal dynamic characteristics and energy efficiency, the current SNNs used for VAR still have problems such as complex input pre - processing, difficult network construction/training, and the need to repeatedly process the same video segment, which limit their practical applications. To solve these problems, the authors propose a directly - trained spiking transformer model - **SVFormer (Spiking Video transFormer)**. The main innovations and advantages of SVFormer are as follows: - **Efficient temporal feature extraction**: By combining local feature extraction, global self - attention mechanism, and the inherent dynamic characteristics, sparsity, and spike - driven characteristics of SNNs, SVFormer can efficiently extract spatio - temporal features. - **Simplified input processing and end - to - end training**: SVFormer can directly process video segments frame by frame without complex input pre - processing, and can be trained end - to - end through the surrogate gradient method, supporting incremental learning and facilitating practical deployment. - **Low energy consumption**: Experimental results show that on the UCF101 dataset, SVFormer achieves a top - 1 accuracy of 84.03% while consuming only 21 mJ/video, demonstrating a significantly better energy - efficiency ratio than previous models. In conclusion, this paper aims to develop a video action recognition model that is both efficient and energy - saving, especially suitable for resource - constrained scenarios.