A Polyphonic SELD Network Based on Attentive Feature Fusion and Multi-stage Training Strategy

Yin Xie,Jie Liu,Ying Hu
DOI: https://doi.org/10.1109/AINIT59027.2023.10212904
2023-01-01
Abstract:Sound event localization and detection (SELD) aims to detect the types and boundaries of sound events and estimate the corresponding direction-of-arrival of sound sources. This paper proposes a SELD network based on attentive feature fusion (AFF-SELD). An attentive feature fusion (AFF) unit is designed to learn the features related to sound sources from channel-wise statistics. Moreover, we adopt a multi-stage training strategy to boost the ability of feature extractors for each stage. We exploit a three-stage data augmentation approach as a data transformation mechanism to realize regularization on various acoustic environments. Experimental results prove the effectiveness of the AFF unit, MSTS, and especially the three-stage data augmentation approach. Our proposed AFF-SELD network outperforms the state-of-the-art method 10.8% and 19.5% on the F £ 20° metric when not applying and applying data augmentation, respectively, taking the TNSSE2020 dataset as an example.
What problem does this paper attempt to address?