Abstract:The objective of sound event localization and detection (SELD) is to accurately identify the temporal occurrence and spatial coordinates of a specific sound category. The existing mainstream offline methods may unintentionally introduce unfavorable future feature information during the training process, thereby potentially hindering the system's performance. The utilization of online methods can lead to improved localization accuracy to a certain extent. Nevertheless, it may result in a diminished ability with the detection capability for sound events. In this paper, a hybrid offline-online method (HOOM) is proposed that involves extracting comprehensive audio information using offline network layers, while simultaneously filtering out irrelevant future information using online network layers. Based on this method, we designed two simple sub-network architectures. The first, convolution and causal convolution alternating network (CCAN), employs regular convolution along with causal convolutions to achieve the offline and online convolution features, respectively. The second, bidirectional and unidirectional alternating network (BUAN), combines bidirectional recurrent neural networks with unidirectional recurrent neural networks, capturing the offline and online contextual sequence information, respectively. Our proposed method demonstrates a 6% improvement in localization recall on the Sony-TAU Realistic Spatial Soundscapes 2023 (STARSS23) dataset. Furthermore, compared to offline or online methods, there is a 4% overall performance improvement. On the detection and classification of acoustic scenes and events 2022 (DCASE2022) synthetic dataset, the overall performance improvement is 5%. These results indicate a significant advantage and provide a novel and robust solution for the SELD task.

U Recurrent Neural Network for Polyphonic Sound Event Detection and Localization

Relational Recurrent Neural Networks for Polyphonic Sound Event Detection

Multi-Scale Recurrent Neural Network for Sound Event Detection

Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks

A Model Ensemble Approach for Sound Event Localization and Detection.

Polyphonic Sound Event Detection and Localization using a Two-Stage Strategy

Dynamic Kernel Convolution Network with Scene-dedicate Training for Sound Event Localization and Detection

Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection

Sound Event Localization and Detection Using Imbalanced Real and Synthetic Data via Multi-Generator

PSELDNets: Pre-trained Neural Networks on Large-scale Synthetic Datasets for Sound Event Localization and Detection

A hybrid offline-online method for sound event localization and detection

Adaptive Memory-Controlled Self-Attention for Polyphonic Sound Event Detection

A Track-Wise Ensemble Event Independent Network for Polyphonic Sound Event Localization and Detection

Decoupling Temporal Convolutional Networks Model in Sound Event Detection and Localization

Multi-Scale Convolutional Recurrent Neural Network with Ensemble Method for Weakly Labeled Sound Event Detection

Sound source detection, localization and classification using consecutive ensemble of CRNN models

Multi-scale Convolutional Recurrent Neural Network and Data Augmentation for Polyphonic Sound Event Detection

MTF-CRNN: Multiscale Time-Frequency Convolutional Recurrent Neural Network for Sound Event Detection.

End-to-End Polyphonic Sound Event Detection Using Convolutional Recurrent Neural Networks with Learned Time-Frequency Representation Input

Polyphonic sound event localization and detection using channel-wise FusionNet

Assessment of Self-Attention on Learned Features For Sound Event Localization and Detection