Abstract:Polyphonic sound event detection (SED) is the task of detecting the time stamps and the class of sound event that occurred during a recording. Real life sound events overlap in recordings, and their durations vary dramatically, making them even harder to recognize. In this paper, we propose Convolutional Recurrent Neural Networks (CRNNs) to extract hidden state feature representations; then, a self-attention mechanism using a symmetric score function is introduced to memorize long-range dependencies of features that the CRNNs extract. Furthermore, we propose to use memory-controlled self-attention to explicitly compute the relations between time steps in audio representation embedding. Then, we propose a strategy for adaptive memory-controlled self-attention mechanisms. Moreover, we applied semi-supervised learning, namely, mean teacher–student methods, to exploit unlabeled audio data. The proposed methods all performed well in the Detection and Classification of Acoustic Scenes and Events (DCASE) 2017 Sound Event Detection in Real Life Audio (task3) test and the DCASE 2021 Sound Event Detection and Separation in Domestic Environments (task4) test. In DCASE 2017 task3, our model surpassed the challenge’s winning system’s F1-score by 6.8%. We show that the proposed adaptive memory-controlled model reached the same performance level as a fixed attention width model. Experimental results indicate that the proposed attention mechanism is able to improve sound event detection. In DCASE 2021 task4, we investigated various pooling strategies in two scenarios. In addition, we found that in weakly labeled semi-supervised sound event detection, building an attention layer on top of the CRNN is needless repetition. This conclusion could be applied to other multi-instance learning problems.

Convolutional bidirectional long short-term memory hidden Markov model hybrid system for polyphonic sound event detection

Duration-Controlled LSTM for Polyphonic Sound Event Detection

Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection

End-to-End Polyphonic Sound Event Detection Using Convolutional Recurrent Neural Networks with Learned Time-Frequency Representation Input

Multi-scale Convolutional Recurrent Neural Network and Data Augmentation for Polyphonic Sound Event Detection

Polyphonic audio event detection: multi-label or multi-class multi-task classification problem?

MTF-CRNN: Multiscale Time-Frequency Convolutional Recurrent Neural Network for Sound Event Detection.

Sound Event Detection in Multichannel Audio Using Spatial and Harmonic Features

Multi-Scale Convolutional Recurrent Neural Network with Ensemble Method for Weakly Labeled Sound Event Detection

A System for the Detection of Polyphonic Sound on a University Campus Based on CapsNet-RNN

Relational Recurrent Neural Networks for Polyphonic Sound Event Detection

Joint Analysis of Acoustic Events and Scenes Based on Multitask Learning

A Sequence Matching Network for Polyphonic Sound Event Localization and Detection

Multi-Scale Recurrent Neural Network for Sound Event Detection

Adaptive Memory-Controlled Self-Attention for Polyphonic Sound Event Detection

Hierarchical-Concatenate Fusion TDNN for sound event classification

Joint Analysis of Sound Events and Acoustic Scenes Using Multitask Learning

A Capsule based Approach for Polyphonic Sound Event Detection

Automatic Speech Recognition : A Study and Performance Evaluation on Neural Networks and Hidden Markov Models

Application of Hidden Markov Chain and Artificial Neural Networks in Music Recognition and Classification

Polyphonic Sound Event Detection and Localization using a Two-Stage Strategy