Improving Audio Spectrogram Transformers for Sound Event Detection Through Multi-Stage Training

Florian Schmid,Paul Primus,Tobias Morocutti,Jonathan Greif,Gerhard Widmer
2024-07-18
Abstract:This technical report describes the CP-JKU team's submission for Task 4 Sound Event Detection with Heterogeneous Training Datasets and Potentially Missing Labels of the DCASE 24 Challenge. We fine-tune three large Audio Spectrogram Transformers, PaSST, BEATs, and ATST, on the joint DESED and MAESTRO datasets in a two-stage training procedure. The first stage closely matches the baseline system setup and trains a CRNN model while keeping the large pre-trained transformer model frozen. In the second stage, both CRNN and transformer are fine-tuned using heavily weighted self-supervised losses. After the second stage, we compute strong pseudo-labels for all audio clips in the training set using an ensemble of all three fine-tuned transformers. Then, in a second iteration, we repeat the two-stage training process and include a distillation loss based on the pseudo-labels, boosting single-model performance substantially. Additionally, we pre-train PaSST and ATST on the subset of AudioSet that comes with strong temporal labels, before fine-tuning them on the Task 4 datasets.
Audio and Speech Processing,Sound
What problem does this paper attempt to address?
The paper primarily addresses Task 4 of the 2024 Detection and Classification of Acoustic Scenes and Events (DCASE) challenge: Sound Event Detection (SED) using heterogeneous training datasets and potentially missing labels. Specifically, the research team's goal is to develop a unified sound event detection system on two different datasets—DESED and MAESTRO Real—which have different characteristics and label granularity, and may have missing labels. To improve the performance of the sound event detection system, the research team adopted the following strategies: 1. **Multi-stage training**: First, freeze large pre-trained audio spectrogram transformer models (such as PaSST, BEATs, and ATST), then train Convolutional Neural Networks (CNN) and Bidirectional Gated Recurrent Units (BiGRU); in the second stage, unfreeze the pre-trained transformer models and fine-tune them. 2. **Multi-iteration training**: Through multiple iterative training processes, generate high-quality pseudo-labels in each iteration and use these pseudo-labels for subsequent training stages to further enhance the performance of individual models. 3. **Selection and adjustment of pre-trained models**: Use various pre-trained models, including ATST, PaSST, and BEATs, and make appropriate architectural adjustments to these models to make them more suitable for the sound event detection task. 4. **Utilizing additional datasets for pre-training**: The research team also used the strongly labeled subset of AudioSet to pre-train some models further to improve their performance. 5. **Data augmentation and sampling strategies**: Applied various data augmentation techniques to improve the model's generalization ability and designed a specific data sampling strategy to balance the contributions of different datasets. 6. **Post-processing methods**: Adopted the Sound Event Bounding Boxes (SEBB) method to optimize the final prediction results. Through the above methods, the research team not only improved the performance of individual models but also further enhanced the overall performance through model ensemble, achieving new best performance, especially for the DESED dataset.