Abstract:The objective of sound event localization and detection (SELD) is to accurately identify the temporal occurrence and spatial coordinates of a specific sound category. The existing mainstream offline methods may unintentionally introduce unfavorable future feature information during the training process, thereby potentially hindering the system's performance. The utilization of online methods can lead to improved localization accuracy to a certain extent. Nevertheless, it may result in a diminished ability with the detection capability for sound events. In this paper, a hybrid offline-online method (HOOM) is proposed that involves extracting comprehensive audio information using offline network layers, while simultaneously filtering out irrelevant future information using online network layers. Based on this method, we designed two simple sub-network architectures. The first, convolution and causal convolution alternating network (CCAN), employs regular convolution along with causal convolutions to achieve the offline and online convolution features, respectively. The second, bidirectional and unidirectional alternating network (BUAN), combines bidirectional recurrent neural networks with unidirectional recurrent neural networks, capturing the offline and online contextual sequence information, respectively. Our proposed method demonstrates a 6% improvement in localization recall on the Sony-TAU Realistic Spatial Soundscapes 2023 (STARSS23) dataset. Furthermore, compared to offline or online methods, there is a 4% overall performance improvement. On the detection and classification of acoustic scenes and events 2022 (DCASE2022) synthetic dataset, the overall performance improvement is 5%. These results indicate a significant advantage and provide a novel and robust solution for the SELD task.

A Polyphonic SELD Network Based on Attentive Feature Fusion and Multi-stage Training Strategy

Joint Spatio-Temporal-Frequency Representation Learning for Improved Sound Event Localization and Detection

Polyphonic sound event localization and detection using channel-wise FusionNet

MFF-EINV2: Multi-scale Feature Fusion across Spectral-Spatial-Temporal Domains for Sound Event Localization and Detection

A Track-Wise Ensemble Event Independent Network for Polyphonic Sound Event Localization and Detection

ICASSP 2022 L3DAS22 Challenge: Ensemble of Resnet-Conformers with Ambisonics Data Augmentation for Sound Event Localization and Detection

Fusion of Audio and Visual Embeddings for Sound Event Localization and Detection

Specialty may be better: A decoupling multi-modal fusion network for Audio-visual event localization

A Four-Stage Data Augmentation Approach to ResNet-Conformer Based Acoustic Modeling for Sound Event Localization and Detection

PSELDNets: Pre-trained Neural Networks on Large-scale Synthetic Datasets for Sound Event Localization and Detection

DATA AUGMENTATION AND PRIOR KNOWLEDGE-BASED REGULARIZATION FOR SOUND EVENT LOCALIZATION AND DETECTION Technical Report

Automated Audio Data Augmentation Network Using Bi-Level Optimization for Sound Event Localization and Detection

Polyphonic Sound Event Detection and Localization using a Two-Stage Strategy

SELD-Mamba: Selective State-Space Model for Sound Event Localization and Detection with Source Distance Estimation

Exploring Audio-Visual Information Fusion for Sound Event Localization and Detection In Low-Resource Realistic Scenarios

A hybrid offline-online method for sound event localization and detection

Improving Sound Event Localization and Detection with Class-Dependent Sound Separation for Real-World Scenarios

Dynamic Kernel Convolution Network with Scene-dedicate Training for Sound Event Localization and Detection

Data Augmentation and Squeeze-and-Excitation Network on Multiple Dimension for Sound Event Localization and Detection in Real Scenes

GLFER-Net: a Polyphonic Sound Source Localization and Detection Network Based on Global-Local Feature Extraction and Recalibration

A Study of Improved Two-Stage Dual-Conv Coordinate Attention Model for Sound Event Detection and Localization