A Model Ensemble Approach for Sound Event Localization and Detection.

Qing Wang,Huaxin Wu,Zijun Jing,Feng Ma,Yi Fang,Yuxuan Wang,Tairan Chen,Jia Pan,Jun Du,Chin-Hui Lee
DOI: https://doi.org/10.1109/iscslp49672.2021.9362116
2021-01-01
Abstract:In this paper, we propose a model ensemble approach for sound event localization and detection (SELD). We adopt several deep neural network (DNN) architectures to perform sound event detection (SED) and direction-of-arrival (DOA) estimation simultaneously. Generally, the DNN architecture consists of three modules stacked together, i.e, a High-level Feature Representation module, a Temporal Context Representation module, and a Fully-connected module in the end. The High-level Feature Representation module usually contains a series of convolutional neural network (CNN) layers to extract useful local features. The Temporal Context Representation module aims to model longer temporal context dependency in the extracted features. There are two parallel branches in the Fully-connected module with one for SED estimation and the other for DOA estimation. With different combinations of implementation in the High-level Feature Representation module and Temporal Context Representation module, several network architectures are used for the SELD task. At last, a more robust prediction of SED and DOA is obtained by model ensemble and post-processing. Tested on the development and evaluation datasets, the proposed approach achieves promising results and ranks the first place in DCASE 2020 task3 challenge. Index Terms: sound event localization and detection, deep neural network, model ensemble
What problem does this paper attempt to address?