Abstract:Video action recognition (VAR) plays crucial roles in various domains such as surveillance, healthcare, and industrial automation, making it highly significant for the society. Consequently, it has long been a research spot in the computer vision field. As artificial neural networks (ANNs) are flourishing, convolution neural networks (CNNs), including 2D-CNNs and 3D-CNNs, as well as variants of the vision transformer (ViT), have shown impressive performance on VAR. However, they usually demand huge computational cost due to the large data volume and heavy information redundancy introduced by the temporal dimension. To address this challenge, some researchers have turned to brain-inspired spiking neural networks (SNNs), such as recurrent SNNs and ANN-converted SNNs, leveraging their inherent temporal dynamics and energy efficiency. Yet, current SNNs for VAR also encounter limitations, such as nontrivial input preprocessing, intricate network construction/training, and the need for repetitive processing of the same video clip, hindering their practical deployment. In this study, we innovatively propose the directly trained SVFormer (Spiking Video transFormer) for VAR. SVFormer integrates local feature extraction, global self-attention, and the intrinsic dynamics, sparsity, and spike-driven nature of SNNs, to efficiently and effectively extract spatio-temporal features. We evaluate SVFormer on two RGB datasets (UCF101, NTU-RGBD60) and one neuromorphic dataset (DVS128-Gesture), demonstrating comparable performance to the mainstream models in a more efficient way. Notably, SVFormer achieves a top-1 accuracy of 84.03% with ultra-low power consumption (21 mJ/video) on UCF101, which is state-of-the-art among directly trained deep SNNs, showcasing significant advantages over prior models.

What problem does this paper attempt to address?

This paper attempts to solve several key problems in video action recognition (VAR): 1. **High computational cost**: Existing artificial neural networks (ANNs), such as convolutional neural networks (CNNs) and vision transformers (ViTs), usually require huge computational resources when processing video data due to the large amount of data, high information redundancy, and the complexity introduced by the time dimension. 2. **Limitations of existing SNNs**: Although some research has turned to brain - inspired spiking neural networks (SNNs) to take advantage of their inherent temporal dynamic characteristics and energy efficiency, the current SNNs used for VAR still have problems such as complex input pre - processing, difficult network construction/training, and the need to repeatedly process the same video segment, which limit their practical applications. To solve these problems, the authors propose a directly - trained spiking transformer model - **SVFormer (Spiking Video transFormer)**. The main innovations and advantages of SVFormer are as follows: - **Efficient temporal feature extraction**: By combining local feature extraction, global self - attention mechanism, and the inherent dynamic characteristics, sparsity, and spike - driven characteristics of SNNs, SVFormer can efficiently extract spatio - temporal features. - **Simplified input processing and end - to - end training**: SVFormer can directly process video segments frame by frame without complex input pre - processing, and can be trained end - to - end through the surrogate gradient method, supporting incremental learning and facilitating practical deployment. - **Low energy consumption**: Experimental results show that on the UCF101 dataset, SVFormer achieves a top - 1 accuracy of 84.03% while consuming only 21 mJ/video, demonstrating a significantly better energy - efficiency ratio than previous models. In conclusion, this paper aims to develop a video action recognition model that is both efficient and energy - saving, especially suitable for resource - constrained scenarios.

SVFormer: A Direct Training Spiking Transformer for Efficient Video Action Recognition

S3TC: Spiking Separated Spatial and Temporal Convolutions with Unsupervised STDP-based Learning for Action Recognition

Spikeformer: Training high-performance spiking neural network with transformer

Event-based Action Recognition Using Motion Information and Spiking Neural Networks

Spike-HAR++: an Energy-Efficient and Lightweight Parallel Spiking Transformer for Event-Based Human Action Recognition

SVFormer: Semi-supervised Video Transformer for Action Recognition

SGLFormer: Spiking Global-Local-Fusion Transformer with high performance

UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer

ReSpike: Residual Frames-based Hybrid Spiking Neural Networks for Efficient Action Recognition

SpikingViT: a Multi-scale Spiking Vision Transformer Model for Event-based Object Detection

Spiking Neural Networks for event-based action recognition: A new task to understand their advantage

DS2TA: Denoising Spiking Transformer with Attenuated Spatiotemporal Attention

VTSNN: a virtual temporal spiking neural network

Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition

Training Robust Spiking Neural Networks with ViewPoint Transform and SpatioTemporal Stretching

Sparser spiking activity can be better: Feature Refine-and-Mask spiking neural network for event-based visual recognition

A Novel Spike Transformer Network for Depth Estimation from Event Cameras via Cross-modality Knowledge Distillation

UniFormerV2: Unlocking the Potential of Image ViTs for Video Understanding

A Study On the Effects of Pre-processing On Spatio-temporal Action Recognition Using Spiking Neural Networks Trained with STDP

Efficient Video Action Detection with Token Dropout and Context Refinement.