Abstract:Polyphonic sound event detection (SED) is the task of detecting the time stamps and the class of sound event that occurred during a recording. Real life sound events overlap in recordings, and their durations vary dramatically, making them even harder to recognize. In this paper, we propose Convolutional Recurrent Neural Networks (CRNNs) to extract hidden state feature representations; then, a self-attention mechanism using a symmetric score function is introduced to memorize long-range dependencies of features that the CRNNs extract. Furthermore, we propose to use memory-controlled self-attention to explicitly compute the relations between time steps in audio representation embedding. Then, we propose a strategy for adaptive memory-controlled self-attention mechanisms. Moreover, we applied semi-supervised learning, namely, mean teacher–student methods, to exploit unlabeled audio data. The proposed methods all performed well in the Detection and Classification of Acoustic Scenes and Events (DCASE) 2017 Sound Event Detection in Real Life Audio (task3) test and the DCASE 2021 Sound Event Detection and Separation in Domestic Environments (task4) test. In DCASE 2017 task3, our model surpassed the challenge’s winning system’s F1-score by 6.8%. We show that the proposed adaptive memory-controlled model reached the same performance level as a fixed attention width model. Experimental results indicate that the proposed attention mechanism is able to improve sound event detection. In DCASE 2021 task4, we investigated various pooling strategies in two scenarios. In addition, we found that in weakly labeled semi-supervised sound event detection, building an attention layer on top of the CRNN is needless repetition. This conclusion could be applied to other multi-instance learning problems.

Multi-Scale Time-Frequency Attention for Acoustic Event Detection

A Joint Detection-Classification Model for Weakly Supervised Sound Event Detection Using Multi-Scale Attention Method

Adaptive Multi-scale Detection of Acoustic Events

MULTI-SCALE CONVOLUTION BASED ATTENTION NETWORK FOR SEMI-SUPERVISED SOUND EVENT DETECTION Technical Report

Event-related data conditioning for acoustic event classification

Multi-scale temporal-frequency attention for music source separation

Divided spectro-temporal attention for sound event localization and detection in real scenes for DCASE2023 challenge

Multi-scale Harmonic Mean Time Surfaces for Event-based Object Classification

Auditory Attention Detection via Cross-Modal Attention

TMac: Temporal Multi-Modal Graph Learning for Acoustic Event Classification

Joint framework with deep feature distillation and adaptive focal loss for weakly supervised audio tagging and acoustic event detection

Research on Acoustic Events Recognition Method with Dimensionality Reduction Combining Attention and Mutual Information

TF(2)AN: A Temporal-Frequency Fusion Attention Network for Spectrum Energy Level Prediction

A Multi-grained based Attention Network for Semi-supervised Sound Event Detection

Multi-Scale Convolutional Recurrent Neural Network with Ensemble Method for Weakly Labeled Sound Event Detection

MTDA-HSED: Mutual-Assistance Tuning and Dual-Branch Aggregating for Heterogeneous Sound Event Detection

Multi-Scale Progressive Fusion Attention Network Based on Small Sample Training for DAS Noise Suppression

Adaptive Memory-Controlled Self-Attention for Polyphonic Sound Event Detection

Time-Frequency Attention for Monaural Speech Enhancement

Multi-scale Convolutional Recurrent Neural Network and Data Augmentation for Polyphonic Sound Event Detection

Multi-modal Attention Mechanisms in LSTM and Its Application to Acoustic Scene Classification