A research for sound event localization and detection based on local–global adaptive fusion and temporal importance network

Di Shi,Min Guo,Miao Ma
DOI: https://doi.org/10.1007/s00530-024-01582-8
IF: 3.9
2024-11-28
Multimedia Systems
Abstract:Sound event localization and detection systems can provide intelligent sound processing and analysis functions for various application devices. However, existing deep learning-based networks mostly rely on simple concatenation of convolutional neural networks (CNN) and recurrent neural networks, which leads to the loss of key feature information in audio. As a result, accurate localization and detection become more difficult. In this paper, we propose a local–global adaptive fusion and temporal importance network model. Firstly, the CNN block and the multi-scale enhanced axial cross attention Transformer block are used to learn the local and global features respectively. Then, the local and global features are effectively fused through the adaptive fusion module. Finally, the positional attention temporal context module is used to explore the positional information in the sound temporal sequence, capturing the important features. Experimental results on the Sony-TAu Reality Spatial Soundscapes 2022 dataset and the synthetic dataset show that the and of the proposed model are reduced to 0.65 and 22.3 , respectively, and the and are increased to 31.1% and 54.8%, respectively, and the comprehensive evaluation metric, , is reduced to 0.48, which achieves better performance compared with other models.
computer science, information systems, theory & methods
What problem does this paper attempt to address?