Weight Sparsity Complements Activity Sparsity in Neuromorphic Language Models

Rishav Mukherji,Mark Schöne,Khaleelulla Khan Nazeer,Christian Mayr,David Kappel,Anand Subramoney
2024-05-01
Abstract:Activity and parameter sparsity are two standard methods of making neural networks computationally more efficient. Event-based architectures such as spiking neural networks (SNNs) naturally exhibit activity sparsity, and many methods exist to sparsify their connectivity by pruning weights. While the effect of weight pruning on feed-forward SNNs has been previously studied for computer vision tasks, the effects of pruning for complex sequence tasks like language modeling are less well studied since SNNs have traditionally struggled to achieve meaningful performance on these tasks. Using a recently published SNN-like architecture that works well on small-scale language modeling, we study the effects of weight pruning when combined with activity sparsity. Specifically, we study the trade-off between the multiplicative efficiency gains the combination affords and its effect on task performance for language modeling. To dissect the effects of the two sparsities, we conduct a comparative analysis between densely activated models and sparsely activated event-based models across varying degrees of connectivity sparsity. We demonstrate that sparse activity and sparse connectivity complement each other without a proportional drop in task performance for an event-based neural network trained on the Penn Treebank and WikiText-2 language modeling datasets. Our results suggest sparsely connected event-based neural networks are promising candidates for effective and efficient sequence modeling.
Machine Learning,Artificial Intelligence,Neural and Evolutionary Computing
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to explore the synergy between weight sparsity and activity sparsity in neuromorphic language models. Specifically, the authors focus on the following issues: 1. **Joint effects of sparse connections and sparse activities**: - Sparse connections reduce the amount of computation by pruning the weights in neural networks. - Sparse activities reduce the amount of communication between neurons through an event - driven mechanism. - The authors study the impact of these two types of sparsity on task performance in recurrent neural networks (RNNs), especially in language modeling tasks. 2. **Applications in complex sequence tasks**: - Although event - driven architectures such as spiking neural networks (SNNs) perform well in computer vision tasks, they perform poorly in complex sequence tasks (such as language modeling). - The authors use a recently released SNN - like architecture to study the effects of sparse connections and sparse activities in language modeling. 3. **Trade - off between efficiency and performance**: - Study how the combination of sparse connections and sparse activities can improve computational efficiency without significantly degrading task performance. - Compare the performance of the densely - activated LSTM model and the sparsely - activated EGRU model under different connection sparsities. 4. **Hardware efficiency**: - Explore how these sparsity techniques can achieve higher energy efficiency and lower latency on neuromorphic hardware, which is especially important for mobile devices and edge - computing systems. ### Main contributions - **Empirical research**: Verified through experiments the independence and synergy of sparse connections and sparse activities in language modeling tasks. - **Performance improvement**: Demonstrated that the combination of sparse connections and sparse activities can significantly reduce the consumption of computational resources while maintaining high task performance. - **Hardware adaptability**: Provided theoretical basis and technical support for the future implementation of efficient language models on neuromorphic hardware. ### Key formulas - **Update rule of EGRU**: \[ y^{(t)}_i = c^{(t)}_i H(c^{(t)}_i - \theta_i) \] where \( H(x)=\begin{cases} 1, & x\geq0 \\ 0, & x < 0 \end{cases} \) - **Calculation of update gate and reset gate**: \[ u^{(t)}=\sigma(W_u x^{(t)}+W_{uy}y^{(t - 1)}+b_u) \] \[ r^{(t)}=\sigma(W_r x^{(t)}+W_{ry}y^{(t - 1)}+b_r) \] - **State update**: \[ z^{(t)}=g(W_z x^{(t)}+W_{zy}(r^{(t)}\odot y^{(t - 1)})+b_z) \] \[ c^{(t)}=u^{(t)}\odot z^{(t)}+(1 - u^{(t)})\odot c^{(t - 1)}-s^{(t)} \] where \( s^{(t)}=\theta H(c^{(t)}-\theta) \) Through these studies, the authors provide important insights for understanding and optimizing sparsity techniques in neuromorphic computing.