Abstract:In this paper, we propose a convolutional recurrent neural network for joint sound event localization and detection (SELD) of multiple overlapping sound events in three-dimensional (3-D) space. The proposed network takes a sequence of consecutive spectrogram time frames as input and maps it to two outputs in parallel. As the first output, the sound event detection (SED) is performed as a multi-label classification task on each time frame producing temporal activity for all the sound event classes. As the second output, localization is performed by estimating the 3-D Cartesian coordinates of the direction-of-arrival (DOA) for each sound event class using multi-output regression. The proposed method is able to associate multiple DOAs with respective sound event labels and further track this association with respect to time. The proposed method uses separately the phase and magnitude component of the spectrogram calculated on each audio channel as the feature, thereby avoiding any method- and array-specific feature extraction. The method is evaluated on five Ambisonic and two circular array format datasets with different overlapping sound events in anechoic, reverberant, and real-life scenarios. The proposed method is compared with two SED, three DOA estimation, and one SELD baselines. The results show that the proposed method is generic and applicable to any array structures, robust to unseen DOA values, reverberation, and low SNR scenarios. The proposed method achieved a consistently higher recall of the estimated number of DOAs across datasets in comparison to the best baseline. Additionally, this recall was observed to be significantly better than the best baseline method for a higher number of overlapping sound events.

Early Detection of Continuous and Partial Audio Events Using CNN

Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection

MTF-CRNN: Multiscale Time-Frequency Convolutional Recurrent Neural Network for Sound Event Detection.

Robust sound event classification using deep neural networks

Multi-Scale Convolutional Recurrent Neural Network with Ensemble Method for Weakly Labeled Sound Event Detection

End-to-End Polyphonic Sound Event Detection Using Convolutional Recurrent Neural Networks with Learned Time-Frequency Representation Input

Robust Audio Sensing with Multi-Sound Classification.

Sound Event Detection Using Spatial Features and Convolutional Recurrent Neural Network

Sound event localization and detection based on crnn using rectangular filters and channel rotation data augmentation

Sound Event Detection of Weakly Labelled Data With CNN-Transformer and Automatic Threshold Optimization

Dilated-Gated Convolutional Neural Network with A New Loss Function on Sound Event Detection.

Non-Negative Matrix Factorization-Convolutional Neural Network (NMF-CNN) For Sound Event Detection

Conditioned Time-Dilated Convolutions for Sound Event Detection

Sound Event Detection in Multichannel Audio Using Spatial and Harmonic Features

MULTI-SCALE CONVOLUTION BASED ATTENTION NETWORK FOR SEMI-SUPERVISED SOUND EVENT DETECTION Technical Report

Cross-Referencing Self-Training Network for Sound Event Detection in Audio Mixtures

Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks

A novel hybrid ensemble approach to enhance the acoustic event classification in environmental sound analysis

Double Mixture: Towards Continual Event Detection from Speech

Sound event detection via dilated convolutional recurrent neural networks

Sound Event Detection for Human Safety and Security in Noisy Environments