HL-ESViT: High-Low Frequency Efficient Spiking Vision Transformer

Kexin Shi,Hanwen Liu,Yi Chen,Hong Qu
DOI: https://doi.org/10.1109/ijcnn60899.2024.10650846
2024-01-01
Abstract:The brain-inspired Spiking Neural Networks (SNNs) offer a promising event-driven and low-power approach to deep learning. Self-attention (SA) mechanism, the cornerstone of the high-performance transformer architecture, enables the model to capture the relationships between different regions of an image. However, the self-attention’s quadratic complexity across long representation sequences hinders the wide application of transformers. In this work, we introduce a novel High-Low Frequency Multi-scale Multi-head Self-Attention mechanism (HL-MMSA) as well as an efficient vision transformer model named HL-ESViT. In HL-MMSA, the input feature maps are processed through high and low pathways and the HL-ESViT departs from stacking transformer blocks repeatedly, diminishing memory and computational costs. To better capture the spatial features of images, we incorporate a novel positional encoding scheme, Relative Position Embedding MultiLayer Perceptron (RPEMP). The proposed HL-ESViT achieves a tradeoff between performance and efficiency. Extensive experiments demonstrate our model’s competitive performance on static datasets CIFAR10, CIFAR100, and neuromorphic datasets DVS128 Gesture and CIFAR10-DVS.
What problem does this paper attempt to address?