Abstract:Polyphonic sound event detection (SED) is the task of detecting the time stamps and the class of sound event that occurred during a recording. Real life sound events overlap in recordings, and their durations vary dramatically, making them even harder to recognize. In this paper, we propose Convolutional Recurrent Neural Networks (CRNNs) to extract hidden state feature representations; then, a self-attention mechanism using a symmetric score function is introduced to memorize long-range dependencies of features that the CRNNs extract. Furthermore, we propose to use memory-controlled self-attention to explicitly compute the relations between time steps in audio representation embedding. Then, we propose a strategy for adaptive memory-controlled self-attention mechanisms. Moreover, we applied semi-supervised learning, namely, mean teacher–student methods, to exploit unlabeled audio data. The proposed methods all performed well in the Detection and Classification of Acoustic Scenes and Events (DCASE) 2017 Sound Event Detection in Real Life Audio (task3) test and the DCASE 2021 Sound Event Detection and Separation in Domestic Environments (task4) test. In DCASE 2017 task3, our model surpassed the challenge’s winning system’s F1-score by 6.8%. We show that the proposed adaptive memory-controlled model reached the same performance level as a fixed attention width model. Experimental results indicate that the proposed attention mechanism is able to improve sound event detection. In DCASE 2021 task4, we investigated various pooling strategies in two scenarios. In addition, we found that in weakly labeled semi-supervised sound event detection, building an attention layer on top of the CRNN is needless repetition. This conclusion could be applied to other multi-instance learning problems.

Enhancing Audio Retrieval with Attention-based Encoder for Audio Feature Representation

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

Audio Sentiment Analysis by Heterogeneous Signal Features Learned from Utterance-Based Parallel Neural Network.

Graph Attention for Automated Audio Captioning

Audio–text retrieval based on contrastive learning and collaborative attention mechanism

An Attention-Based Neural Network Approach For Single Channel Speech Enhancement

Improving Visual Speech Enhancement Network by Learning Audio-visual Affinity with Multi-head Attention

U-Former: Improving Monaural Speech Enhancement with Multi-head Self and Cross Attention

An Attention Based Speaker-Independent Audio-Visual Deep Learning Model for Speech Enhancement

Weakly Labelled AudioSet Tagging With Attention Neural Networks

Attention-Based Audio Embeddings for Query-by-Example

Attention-Guided Neural Networks for Full-Reference and No-Reference Audio-Visual Quality Assessment

Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning

Bridging Language Gaps in Audio-Text Retrieval

Enhance audio-visual segmentation with hierarchical encoder and audio guidance

Adaptive Memory-Controlled Self-Attention for Polyphonic Sound Event Detection

Speech enhancement with weakly labelled data from AudioSet

Exploring the Power of Pure Attention Mechanisms in Blind Room Parameter Estimation

Matching Text and Audio Embeddings: Exploring Transfer-learning Strategies for Language-based Audio Retrieval

Acoustic scene classification by feed forward neural network with class dependent attention mechanism

High-Resolution Attention Network with Acoustic Segment Model for Acoustic Scene Classification