Abstract:Speech recognition (SR) has been improved significantly by artificial neural networks (ANNs), but ANNs have the drawbacks of biologically implausibility and excessive power consumption because of the nonlocal transfer of real-valued errors and weights. While spiking neural networks (SNNs) have the potential to solve these drawbacks of ANNs due to their efficient spike communication and their natural way to utilize kinds of synaptic plasticity rules found in brain for weight modification. However, existing SNN models for SR either had bad performance, or were trained in biologically implausible ways. In this paper, we present a biologically inspired convolutional SNN model for SR. The network adopts the time-to-first-spike coding scheme for fast and efficient information processing. A biological learning rule, spike-timing-dependent plasticity (STDP), is used to adjust the synaptic weights of convolutional neurons to form receptive fields in an unsupervised way. In the convolutional structure, the strategy of local weight sharing is introduced and could lead to better feature extraction of speech signals than global weight sharing. We first evaluated the SNN model with a linear support vector machine (SVM) on the TIDIGITS dataset and it got the performance of 97.5%, comparable to the best results of ANNs. Deep analysis on network outputs showed that, not only are the output data more linearly separable, but they also have fewer dimensions and become sparse. To further confirm the validity of our model, we trained it on a more difficult recognition task based on the TIMIT dataset, and it got a high performance of 93.8%. Moreover, a linear spike-based classifier-tempotron-can also achieve high accuracies very close to that of SVM on both the two tasks. These demonstrate that an STDP-based convolutional SNN model equipped with local weight sharing and temporal coding is capable of solving the SR task accurately and efficiently.

SpikeVoice: High-Quality Text-to-Speech Via Efficient Spiking Neural Network

Towards Energy-Preserving Natural Language Understanding with Spiking Neural Networks

Deep Spiking Neural Networks for Large Vocabulary Automatic Speech Recognition

SpikingMiniLM: Energy-Efficient Spiking Transformer for Natural Language Understanding

Unsupervised speech recognition through spike-timing-dependent plasticity in a convolutional spiking neural network

Spike Trains Encoding and Threshold Rescaling Method for Deep Spiking Neural Networks

Spike-based Encoding and Learning of Spectrum Features for Robust Sound Recognition.

Spiking Deep Residual Networks.

Spiking Convolutional Neural Networks for Text Classification

An Efficient and Perceptually Motivated Auditory Neural Encoding and Decoding Algorithm for Spiking Neural Networks

Spike Trains Encoding Optimization for Spiking Neural Networks Implementation in FPGA

Efficient Speech Command Recognition Leveraging Spiking Neural Network and Curriculum Learning-based Knowledge Distillation

Spikeformer: Training high-performance spiking neural network with transformer

sVAD: A Robust, Low-Power, and Light-Weight Voice Activity Detection with Spiking Neural Networks

Towards Ultra-Low-Power Neuromorphic Speech Enhancement with Spiking-FullSubNet

SpikeCLIP: A Contrastive Language-Image Pretrained Spiking Neural Network

Scaling Spike-driven Transformer with Efficient Spike Firing Approximation Training

Spiking Structured State Space Model for Monaural Speech Enhancement

SpikeLM: Towards General Spike-Driven Language Modeling via Elastic Bi-Spiking Mechanisms

DPSNN: Spiking Neural Network for Low-Latency Streaming Speech Enhancement

Human-Inspired Audio-Visual Speech Recognition: Spike Activity, Cueing Interaction and Causal Processing