A Dynamic Convolution-Transformer Neural Network for Multiple Sound Source Localization Based on Functional Beamforming
Ge Zhang,Lin Geng,Feng Xie,Chun-Dong He
DOI: https://doi.org/10.1016/j.ymssp.2024.111272
IF: 8.4
2024-01-01
Mechanical Systems and Signal Processing
Abstract:Deep learning has achieved a vital breakthrough in the sound source localization and has overcome the limitations of conventional model-based approaches. The deep learning-based method is to obtain a clear sound source distribution map from the information collected by a microphone array for identifying sound sources accurately. The performance of the deep learning-based method is closely related to the selections of input features and the architecture of networks. Consequently, a novel dynamic convolution-Transformer neural network (DYCTNN) is proposed to precisely estimate the number, positions, and strengths of multiple sound sources with high resolution. In this paper, the functional beamforming (FB) map is first served as input of networks. Then, a target function is tailored to generate the corresponding target map including the position and strength information of multiple sound sources as the ground truth for training the networks. Finally, the dynamic convolution-Transformer neural network with an encoder-decoder structure is constructed for the localization task of multiple sound sources. The proposed DYCTNN combines the advantages of the dynamic convolution and Transformer. The dynamic convolution can adaptively aggregate the multiple convolutional kernels according to the input maps to fully extract the local features, which effectively improves the representation capability of the network model for predicting the different spatial distribution characteristics of sound sources. Besides, the self-attention mechanism of the Transformer can effectively extract the global information of sound source spatial distribution from the FB map for improving the localization accuracy, and the depth-wise convolution can reduce the network model parameters and accelerate the training process. A numerical simulation with a 60-channel spiral microphone array is conducted to observe the localization capacity of the proposed DYCTNN. The quantitative and qualitative comparison results first prove that the proposed DYCTNN can predict the number, positions, and strengths of multiple sound sources very well, and its accuracy of sound source localization is better than that of the current deep learning-based method. In addition, the robustness and generalization of the proposed DYCTNN are also proven by comparing with the current deep learning-based method under several unseen acoustic conditions, such as signal–noise-ratio, measurement distance, frequency, and numbers of sound sources and microphones. Finally, the effectiveness of the dynamic convolution and self-attention mechanism in the proposed DYCTNN for multiple sound source localization is verified through an ablation study.