Bipolar Population Threshold Encoding for Audio Recognition with Deep Spiking Neural Networks

Xiaocui Lin,Jiangrong Shen,Jun Wen,Huajin Tang
DOI: https://doi.org/10.1109/IJCNN54540.2023.10191235
2023-01-01
Abstract:Spiking Neural Networks (SNNs) have been increasingly investigated for audio recognition due to the low power consumption on neuromorphic hardware by mimicking biological neural systems. Since the SNNs are learned from spikes, a critical step lies in the efficient neural encoding of real-valued sound signals to represent complex temporal patterns in speech and environmental sounds. In this paper, we propose a novel Bipolar Population Threshold (BPT) encoding model that effectively captures the trajectory information of time-series speech data by combining temporal and spatial dimensions. The bipolar encoding technique uses positive and negative neurons to capture the dynamic changes in the audio signal, while the threshold intervals allow for a sparse representation that focuses on encoding significant changes, resulting in an efficient and simplified recognition process. Extensively experimenting on three benchmark datasets including the TIDIGITS with speeches, RWCP with sounds, and MedleyDB with music, the numeric results show the superiority of the proposed method by consistently outperforming the state-of-the-art approaches while with fewer spikes, especially in capturing the complex spatiotemporal patterns of audio signals.
What problem does this paper attempt to address?