Abstract:Sound Event Detection and Localization (SELD) is a comprehensive task that aims to solve the subtasks of Sound Event Detection (SED) and Sound Source Localization (SSL) simultaneously. The task of SELD lies in the need to solve both sound recognition and spatial localization problems, and different categories of sound events may overlap in time and space, making it more difficult for the model to distinguish between different events occurring at the same time and to locate the sound source. In this study, the Dual-conv Coordinate Attention Module (DCAM) combines dual convolutional blocks and Coordinate Attention, and based on this, the network architecture based on the two-stage strategy is improved to form the SELD-oriented Two-Stage Dual-conv Coordinate Attention Model (TDCAM) for SELD. TDCAM draws on the concepts of Visual Geometry Group (VGG) networks and Coordinate Attention to effectively capture critical local information by focusing on the coordinate space information of the feature map and dealing with the relationship between the feature map channels to enhance the feature selection capability of the model. To address the limitation of a single-layer Bi-directional Gated Recurrent Unit (Bi-GRU) in the two-stage network in terms of timing processing, we add to the structure of the two-layer Bi-GRU and introduce the data enhancement techniques of the frequency mask and time mask to improve the modeling and generalization ability of the model for timing features. Through experimental validation on the TAU Spatial Sound Events 2019 development dataset, our approach significantly improves the performance of SELD compared to the two-stage network baseline model. Furthermore, the effectiveness of DCAM and the two-layer Bi-GRU structure is confirmed by performing ablation experiments.

Improved Self-Consistency Training with Selective Feature Fusion for Sound Event Detection

A scene-dependent sound event detection approach using multi-task learning

Adaptive Memory-Controlled Self-Attention for Polyphonic Sound Event Detection

RCT: Random Consistency Training for Semi-supervised Sound Event Detection

Sound Event Detection by Consistency Training and Pseudo-Labeling with Feature-Pyramid Convolutional Recurrent Neural Networks

An Effective Perturbation Based Semi-Supervised Learning Method for Sound Event Detection

Sound Activity-aware Based Cross-task Collaborative Training for Semi-supervised Sound Event Detection

A Multi-Task Learning Framework for Sound Event Detection using High-level Acoustic Characteristics of Sounds

A Multi-grained based Attention Network for Semi-supervised Sound Event Detection

Fine-tuning Audio Spectrogram Transformer with Task-aware Adapters for Sound Event Detection.

Cross-Referencing Self-Training Network for Sound Event Detection in Audio Mixtures

Weakly and semi-supervised learning for sound event detection using image pretrained convolutional recurrent neural network, weighted pooling and mean teacher method

A Joint Detection-Classification Model for Weakly Supervised Sound Event Detection Using Multi-Scale Attention Method

Self-training with noisy student model and semi-supervised loss function for dcase 2021 challenge task 4

An Effective Mutual Mean Teaching based Domain Adaptation Method for Sound Event Detection

Multi-dimensional frequency dynamic convolution with confident mean teacher for sound event detection

Task-Aware Mean Teacher Method for Large Scale Weakly Labeled Semi-Supervised Sound Event Detection

A Study of Improved Two-Stage Dual-Conv Coordinate Attention Model for Sound Event Detection and Localization

Sound Event Detection and Time-Frequency Segmentation from Weakly Labelled Data

MULTI-SCALE CONVOLUTION BASED ATTENTION NETWORK FOR SEMI-SUPERVISED SOUND EVENT DETECTION Technical Report

UCIL: An Unsupervised Class Incremental Learning Approach for Sound Event Detection