Attention-free Spikformer: Mixing Spike Sequences with Simple Linear Transforms

Qingyu Wang,Duzhen Zhang,Tielin Zhang,Bo Xu
2023-08-17
Abstract:By integrating the self-attention capability and the biological properties of Spiking Neural Networks (SNNs), Spikformer applies the flourishing Transformer architecture to SNNs design. It introduces a Spiking Self-Attention (SSA) module to mix sparse visual features using spike-form Query, Key, and Value, resulting in the State-Of-The-Art (SOTA) performance on numerous datasets compared to previous SNN-like frameworks. In this paper, we demonstrate that the Spikformer architecture can be accelerated by replacing the SSA with an unparameterized Linear Transform (LT) such as Fourier and Wavelet transforms. These transforms are utilized to mix spike sequences, reducing the quadratic time complexity to log-linear time complexity. They alternate between the frequency and time domains to extract sparse visual features, showcasing powerful performance and efficiency. We conduct extensive experiments on image classification using both neuromorphic and static datasets. The results indicate that compared to the SOTA Spikformer with SSA, Spikformer with LT achieves higher Top-1 accuracy on neuromorphic datasets (i.e., CIFAR10-DVS and DVS128 Gesture) and comparable Top-1 accuracy on static datasets (i.e., CIFAR-10 and CIFAR-100). Furthermore, Spikformer with LT achieves approximately 29-51% improvement in training speed, 61-70% improvement in inference speed, and reduces memory usage by 4-26% due to not requiring learnable parameters.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main objective of this paper is to explore whether simpler sequence mixing mechanisms (such as Fourier transform or wavelet transform) can completely replace the relatively complex Spiking Self-Attention (SSA) sublayer in the Spikformer architecture. The study found that even simple linear transformations without learnable parameters (such as Fourier transform and wavelet transform) can achieve higher Top-1 accuracy than SSA on neuromorphic datasets and exhibit comparable performance on static datasets. Additionally, these simple linear transformations significantly improve computational efficiency, reduce memory usage, and enhance training and inference speeds by approximately 29-51% and 61-70%, respectively. Specifically, the main contributions of the paper include: 1. Demonstrating that even simple linear transformations like Fourier transform and wavelet transform can effectively extract sparse visual features, with surprising results indicating that SSA may not be the key factor driving Spikformer's performance. 2. Introducing a new Spikformer variant that utilizes Fourier transform or wavelet transform for sequence mixing, and providing a comprehensive analysis of the time complexity of different sequence mixing mechanisms. 3. Extensive experiments validating that the proposed Spikformer with LT achieves higher Top-1 accuracy on neuromorphic datasets compared to the original Spikformer with SSA, and shows comparable performance on static datasets, while also significantly improving computational efficiency and reducing memory usage by 4-26%.