U Recurrent Neural Network for Polyphonic Sound Event Detection and Localization

Lihong Pi,Xue Zheng,Chun Zhang,Ping Chen,Zhe Wang,Xiangyu Li
DOI: https://doi.org/10.1145/3404716.3404726
2020-01-01
Abstract:The polyphonic sound event detection and localization (SELD) system indicates the temporal onset and offset time of sound events to be detected and tracks the spatial location of the acoustic source. It involves two processes, the sound event detection (SED) and the estimation of the direction of arrival (DOA). However, previous models only extract features by simply stacking convolutional layers, thus leading to two problems, one is that the network is difficult to deepen and the expressive capability of the model is limited, another problem is that they utilize only highlevel features, and lack a feature description of the low-level texture features of the sound signal. In this paper, a novel model called U recurrent neural network (URNN) is proposed to alleviate those problems, it combines the low-level and high-level features in the model without significantly increasing computation costs, and exploits the identity layer to make the network deeper. The proposed method is evaluated on the Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 Task 3 dataset [1], which covers distinct overlapping sound events collected from different environments. Experimental results show that the proposed URNN significantly reduce the SELD error by 16.2% compared to the baseline model SELDnet [2] and 2.5% compared to the improved convolutional recurrent neural network (CRNN).
What problem does this paper attempt to address?