Abstract:Stability arguments are often used to prevent learning algorithms from having ever increasing activity and weights that hinder generalization. However, stability conditions can clash with the sparsity required to augment the energy efficiency of spiking neurons. Nonetheless it can also provide solutions. In fact, spiking Neuromorphic Computing uses binary activity to improve Artificial Intelligence energy efficiency. However, its non-smoothness requires approximate gradients, known as Surrogate Gradients (SG), to close the performance gap with Deep Learning. Several SG have been proposed in the literature, but it remains unclear how to determine the best SG for a given task and network. Thus, we aim at theoretically define the best SG, through stability arguments, to reduce the need for grid search. In fact, we show that more complex tasks and networks need more careful choice of SG, even if overall the derivative of the fast sigmoid tends to outperform the other, for a wide range of learning rates. We therefore design a stability based theoretical method to choose initialization and SG shape before training on the most common spiking neuron, the Leaky Integrate and Fire (LIF). Since our stability method suggests the use of high firing rates at initialization, which is non-standard in the neuromorphic literature, we show that high initial firing rates, combined with a sparsity encouraging loss term introduced gradually, can lead to better generalization, depending on the SG shape. Our stability based theoretical solution, finds a SG and initialization that experimentally result in improved accuracy. We show how it can be used to reduce the need of extensive grid-search of dampening, sharpness and tail-fatness of the SG. We also show that our stability concepts can be extended to be applicable on different LIF variants, such as DECOLLE and fluctuations-driven initializations.

Spike No More: Stabilizing the Pre-training of Large Language Models

Initialization of Large Language Models via Reparameterization to Mitigate Loss Spikes

Loss Spike in Training Neural Networks

Stable and low-precision training for large-scale vision-language models

Spike Trains Encoding and Threshold Rescaling Method for Deep Spiking Neural Networks

SpikeLLM: Scaling up Spiking Neural Network to Large Language Models via Saliency-based Spiking

SpikingMiniLM: Energy-Efficient Spiking Transformer for Natural Language Understanding

SpikeLM: Towards General Spike-Driven Language Modeling via Elastic Bi-Spiking Mechanisms

Stable Language Model Pre-training by Reducing Embedding Variability

Gradient Scaling on Deep Spiking Neural Networks with Spike-Dependent Local Information

SpikeGPT: Generative Pre-trained Language Model with Spiking Neural Networks

To Spike or Not to Spike, that is the Question

High-performance deep spiking neural networks with 0.3 spikes per neuron

Methods of improving LLM training stability

Small-scale proxies for large-scale Transformer training instabilities

Always-Sparse Training by Growing Connections with Guided Stochastic Exploration

Stabilizing Spiking Neuron Training

Take A Shortcut Back: Mitigating the Gradient Vanishing for Training Spiking Neural Networks

Pipelined Backpropagation at Scale: Training Large Models without Batches

Stabilizing RNN Gradients through Pre-training