MAT-SED: A Masked Audio Transformer with Masked-Reconstruction Based Pre-training for Sound Event Detection

Pengfei Cai,Yan Song,Kang Li,Haoyu Song,Ian McLoughlin
2024-08-19
Abstract:Sound event detection (SED) methods that leverage a large pre-trained Transformer encoder network have shown promising performance in recent DCASE challenges. However, they still rely on an RNN-based context network to model temporal dependencies, largely due to the scarcity of labeled data. In this work, we propose a pure Transformer-based SED model with masked-reconstruction based pre-training, termed MAT-SED. Specifically, a Transformer with relative positional encoding is first designed as the context network, pre-trained by the masked-reconstruction task on all available target data in a self-supervised way. Both the encoder and the context network are jointly fine-tuned in a semi-supervised manner. Furthermore, a global-local feature fusion strategy is proposed to enhance the localization capability. Evaluation of MAT-SED on DCASE2023 task4 surpasses state-of-the-art performance, achieving 0.587/0.896 PSDS1/PSDS2 respectively.
Sound,Artificial Intelligence,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to make full use of large - scale unlabeled data in the Sound Event Detection (SED) task and overcome the problem of scarce labeled data to improve the performance of the model?** Specifically, although existing SED methods utilize large pre - trained Transformer encoder networks, they still rely on RNNs (Recurrent Neural Networks) when modeling time - dependence, mainly due to the scarcity of labeled data. To solve this problem, the authors propose a pure - Transformer - structured SED model - MAT - SED (Masked Audio Transformer for Sound Event Detection). This model improves the performance of the SED task in the following ways: 1. **Using a pure - Transformer structure**: MAT - SED uses Transformer as the context network instead of the traditional RNN to better capture context - dependency relationships over a long time range. 2. **Self - supervised pre - training**: Self - supervise pre - training of the context network through the masked - reconstruction task, thereby maximizing the use of a large amount of unlabeled data. 3. **Global - local feature fusion strategy**: Introduce a global - local feature fusion strategy in the fine - tuning stage to enhance the model's localization ability. Through these improvements, MAT - SED outperforms the existing state - of - the - art SED systems on DCASE2023 Task 4, achieving scores of 0.587 and 0.896 on the PSDS1 and PSDS2 metrics respectively. ### Formula summary - **Reconstruction loss function**: \[ L_m=\sum_{x\in D}\sum_{t\in M_x}(\hat{z}_t(x)-z_t(x))^2 \] where \(D\) represents the pre - training data set, and \(M_x\) represents the set of masked frame indices in sample \(x\). - **Global - local feature fusion**: \[ Z_{\text{fused}}=\lambda Z_{\text{local}}+(1 - \lambda)Z_{\text{global}} \] where \(\lambda\) is a hyperparameter that controls the proportion of global and local feature fusion, and is set to 0.5 in the experiment. These methods enable MAT - SED to still achieve excellent performance in the case of limited labeled data.