Abstract:Sound Event Detection and Localization (SELD) is a comprehensive task that aims to solve the subtasks of Sound Event Detection (SED) and Sound Source Localization (SSL) simultaneously. The task of SELD lies in the need to solve both sound recognition and spatial localization problems, and different categories of sound events may overlap in time and space, making it more difficult for the model to distinguish between different events occurring at the same time and to locate the sound source. In this study, the Dual-conv Coordinate Attention Module (DCAM) combines dual convolutional blocks and Coordinate Attention, and based on this, the network architecture based on the two-stage strategy is improved to form the SELD-oriented Two-Stage Dual-conv Coordinate Attention Model (TDCAM) for SELD. TDCAM draws on the concepts of Visual Geometry Group (VGG) networks and Coordinate Attention to effectively capture critical local information by focusing on the coordinate space information of the feature map and dealing with the relationship between the feature map channels to enhance the feature selection capability of the model. To address the limitation of a single-layer Bi-directional Gated Recurrent Unit (Bi-GRU) in the two-stage network in terms of timing processing, we add to the structure of the two-layer Bi-GRU and introduce the data enhancement techniques of the frequency mask and time mask to improve the modeling and generalization ability of the model for timing features. Through experimental validation on the TAU Spatial Sound Events 2019 development dataset, our approach significantly improves the performance of SELD compared to the two-stage network baseline model. Furthermore, the effectiveness of DCAM and the two-layer Bi-GRU structure is confirmed by performing ablation experiments.

Decoupling Temporal Convolutional Networks Model in Sound Event Detection and Localization

Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks

Specialty may be better: A decoupling multi-modal fusion network for Audio-visual event localization

Polyphonic Sound Event Detection and Localization using a Two-Stage Strategy

A Study of Improved Two-Stage Dual-Conv Coordinate Attention Model for Sound Event Detection and Localization

Joint Spatio-Temporal-Frequency Representation Learning for Improved Sound Event Localization and Detection

Dynamic Kernel Convolution Network with Scene-dedicate Training for Sound Event Localization and Detection

Sound Event Localization and Detection Based on Iterative Separation in Embedding Space

A Model Ensemble Approach for Sound Event Localization and Detection.

Conditioned Time-Dilated Convolutions for Sound Event Detection

On Local Temporal Embedding for Semi-Supervised Sound Event Detection

COMBINED SOUND EVENT DETECTION AND SOUND EVENT SEPARATION NETWORKS FOR DCASE 2020 TASK 4 Technical Report

Sound Event Localization and Detection Based on Multiple DOA Beamforming and Multi-Task Learning

U Recurrent Neural Network for Polyphonic Sound Event Detection and Localization

Sound Event Detection Using Spatial Features and Convolutional Recurrent Neural Network

A Track-Wise Ensemble Event Independent Network for Polyphonic Sound Event Localization and Detection

Auditory Neural Response Inspired Sound Event Detection Based on Spectro-temporal Receptive Field

Improving Sound Event Localization and Detection with Class-Dependent Sound Separation for Real-World Scenarios

Polyphonic sound event localization and detection using channel-wise FusionNet

Hierarchical-Concatenate Fusion TDNN for sound event classification

Sound Event Localization and Detection for Real Spatial Sound Scenes: Event-Independent Network and Data Augmentation Chains