Abstract:Sound event detection (SED) methods that leverage a large pre-trained Transformer encoder network have shown promising performance in recent DCASE challenges. However, they still rely on an RNN-based context network to model temporal dependencies, largely due to the scarcity of labeled data. In this work, we propose a pure Transformer-based SED model with masked-reconstruction based pre-training, termed MAT-SED. Specifically, a Transformer with relative positional encoding is first designed as the context network, pre-trained by the masked-reconstruction task on all available target data in a self-supervised way. Both the encoder and the context network are jointly fine-tuned in a semi-supervised manner. Furthermore, a global-local feature fusion strategy is proposed to enhance the localization capability. Evaluation of MAT-SED on DCASE2023 task4 surpasses state-of-the-art performance, achieving 0.587/0.896 PSDS1/PSDS2 respectively.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **How to make full use of large - scale unlabeled data in the Sound Event Detection (SED) task and overcome the problem of scarce labeled data to improve the performance of the model?** Specifically, although existing SED methods utilize large pre - trained Transformer encoder networks, they still rely on RNNs (Recurrent Neural Networks) when modeling time - dependence, mainly due to the scarcity of labeled data. To solve this problem, the authors propose a pure - Transformer - structured SED model - MAT - SED (Masked Audio Transformer for Sound Event Detection). This model improves the performance of the SED task in the following ways: 1. **Using a pure - Transformer structure**: MAT - SED uses Transformer as the context network instead of the traditional RNN to better capture context - dependency relationships over a long time range. 2. **Self - supervised pre - training**: Self - supervise pre - training of the context network through the masked - reconstruction task, thereby maximizing the use of a large amount of unlabeled data. 3. **Global - local feature fusion strategy**: Introduce a global - local feature fusion strategy in the fine - tuning stage to enhance the model's localization ability. Through these improvements, MAT - SED outperforms the existing state - of - the - art SED systems on DCASE2023 Task 4, achieving scores of 0.587 and 0.896 on the PSDS1 and PSDS2 metrics respectively. ### Formula summary - **Reconstruction loss function**: \[ L_m=\sum_{x\in D}\sum_{t\in M_x}(\hat{z}_t(x)-z_t(x))^2 \] where \(D\) represents the pre - training data set, and \(M_x\) represents the set of masked frame indices in sample \(x\). - **Global - local feature fusion**: \[ Z_{\text{fused}}=\lambda Z_{\text{local}}+(1 - \lambda)Z_{\text{global}} \] where \(\lambda\) is a hyperparameter that controls the proportion of global and local feature fusion, and is set to 0.5 in the experiment. These methods enable MAT - SED to still achieve excellent performance in the case of limited labeled data.

MAT-SED: A Masked Audio Transformer with Masked-Reconstruction Based Pre-training for Sound Event Detection

AST-SED: An Effective Sound Event Detection Method Based on Audio Spectrogram Transformer

Prototype based Masked Audio Model for Self-Supervised Learning of Sound Event Detection

Sound Event Detection Transformer: An Event-based End-to-End Model for Sound Event Detection

Multi-Iteration Multi-Stage Fine-Tuning of Transformers for Sound Event Detection with Heterogeneous Datasets

SELD-Mamba: Selective State-Space Model for Sound Event Localization and Detection with Source Distance Estimation

MTDA-HSED: Mutual-Assistance Tuning and Dual-Branch Aggregating for Heterogeneous Sound Event Detection

Auditory Neural Response Inspired Sound Event Detection Based on Spectro-temporal Receptive Field

Sound Event Detection by Consistency Training and Pseudo-Labeling with Feature-Pyramid Convolutional Recurrent Neural Networks

Sound event detection based on auxiliary decoder and maximum probability aggregation for DCASE Challenge 2024 Task 4

Sound Event Detection of Weakly Labelled Data With CNN-Transformer and Automatic Threshold Optimization

Interactive Dual-Conformer with Scene-Inspired Mask for Soft Sound Event Detection

A Multi-Task Learning Framework for Sound Event Detection using High-level Acoustic Characteristics of Sounds

DiffSED: Sound Event Detection with Denoising Diffusion

Improving Audio Spectrogram Transformers for Sound Event Detection Through Multi-Stage Training

A scene-dependent sound event detection approach using multi-task learning

Conditioned Time-Dilated Convolutions for Sound Event Detection

COMBINED SOUND EVENT DETECTION AND SOUND EVENT SEPARATION NETWORKS FOR DCASE 2020 TASK 4 Technical Report

Sound Event Detection by Multitask Learning of Sound Events and Scenes with Soft Scene Labels

Sound Event Localization and Detection Using Imbalanced Real and Synthetic Data via Multi-Generator

Self Training and Ensembling Frequency Dependent Networks with Coarse Prediction Pooling and Sound Event Bounding Boxes