Abstract:This technical report describes the CP-JKU team's submission for Task 4 Sound Event Detection with Heterogeneous Training Datasets and Potentially Missing Labels of the DCASE 24 Challenge. We fine-tune three large Audio Spectrogram Transformers, PaSST, BEATs, and ATST, on the joint DESED and MAESTRO datasets in a two-stage training procedure. The first stage closely matches the baseline system setup and trains a CRNN model while keeping the large pre-trained transformer model frozen. In the second stage, both CRNN and transformer are fine-tuned using heavily weighted self-supervised losses. After the second stage, we compute strong pseudo-labels for all audio clips in the training set using an ensemble of all three fine-tuned transformers. Then, in a second iteration, we repeat the two-stage training process and include a distillation loss based on the pseudo-labels, boosting single-model performance substantially. Additionally, we pre-train PaSST and ATST on the subset of AudioSet that comes with strong temporal labels, before fine-tuning them on the Task 4 datasets.

What problem does this paper attempt to address?

The paper primarily addresses Task 4 of the 2024 Detection and Classification of Acoustic Scenes and Events (DCASE) challenge: Sound Event Detection (SED) using heterogeneous training datasets and potentially missing labels. Specifically, the research team's goal is to develop a unified sound event detection system on two different datasets—DESED and MAESTRO Real—which have different characteristics and label granularity, and may have missing labels. To improve the performance of the sound event detection system, the research team adopted the following strategies: 1. **Multi-stage training**: First, freeze large pre-trained audio spectrogram transformer models (such as PaSST, BEATs, and ATST), then train Convolutional Neural Networks (CNN) and Bidirectional Gated Recurrent Units (BiGRU); in the second stage, unfreeze the pre-trained transformer models and fine-tune them. 2. **Multi-iteration training**: Through multiple iterative training processes, generate high-quality pseudo-labels in each iteration and use these pseudo-labels for subsequent training stages to further enhance the performance of individual models. 3. **Selection and adjustment of pre-trained models**: Use various pre-trained models, including ATST, PaSST, and BEATs, and make appropriate architectural adjustments to these models to make them more suitable for the sound event detection task. 4. **Utilizing additional datasets for pre-training**: The research team also used the strongly labeled subset of AudioSet to pre-train some models further to improve their performance. 5. **Data augmentation and sampling strategies**: Applied various data augmentation techniques to improve the model's generalization ability and designed a specific data sampling strategy to balance the contributions of different datasets. 6. **Post-processing methods**: Adopted the Sound Event Bounding Boxes (SEBB) method to optimize the final prediction results. Through the above methods, the research team not only improved the performance of individual models but also further enhanced the overall performance through model ensemble, achieving new best performance, especially for the DESED dataset.

Improving Audio Spectrogram Transformers for Sound Event Detection Through Multi-Stage Training

Multi-Iteration Multi-Stage Fine-Tuning of Transformers for Sound Event Detection with Heterogeneous Datasets

FMSG-JLESS Submission for DCASE 2024 Task4 on Sound Event Detection with Heterogeneous Training Dataset and Potentially Missing Labels

Training Sound Event Detection On A Heterogeneous Dataset

Fine-tuning Audio Spectrogram Transformer with Task-aware Adapters for Sound Event Detection.

A Hybrid System of Sound Event Detection Transformer and Frame-wise Model for DCASE 2022 Task 4

Effective Pre-Training of Audio Transformers for Sound Event Detection

Sound Event Localization and Detection for Real Spatial Sound Scenes: Event-Independent Network and Data Augmentation Chains

DCASE 2024 Task 4: Sound Event Detection with Heterogeneous Data and Missing Labels

Sound event detection based on auxiliary decoder and maximum probability aggregation for DCASE Challenge 2024 Task 4

AST-SED: An Effective Sound Event Detection Method Based on Audio Spectrogram Transformer

A Four-Stage Data Augmentation Approach to ResNet-Conformer Based Acoustic Modeling for Sound Event Localization and Detection

MAT-SED: A Masked Audio Transformer with Masked-Reconstruction Based Pre-training for Sound Event Detection

Semi-supervsied Learning-based Sound Event Detection using Freuqency Dynamic Convolution with Large Kernel Attention for DCASE Challenge 2023 Task 4

ZHENG USTC TEAM’S SUBMISSION FOR DCASE2021 TASK4-SEMI-SUPERVISED SOUND EVENT DETECTION Technical Report

MTDA-HSED: Mutual-Assistance Tuning and Dual-Branch Aggregating for Heterogeneous Sound Event Detection

Squeeze-and-Excite ResNet-Conformers for Sound Event Localization, Detection, and Distance Estimation for DCASE 2024 Challenge

A Multi-Task Learning Framework for Sound Event Detection using High-level Acoustic Characteristics of Sounds

Sound event detection using weakly labeled and unlabeled data with self-adaptive event threshold

A BENCHMARK OF STATE-OF-THE-ART SOUND EVENT DETECTION SYSTEMS EVALUATED ON SYNTHETIC SOUNDSCAPES

Fine-tune the pretrained ATST model for sound event detection