SELD-Mamba: Selective State-Space Model for Sound Event Localization and Detection with Source Distance Estimation

Da Mu,Zhicheng Zhang,Haobo Yue,Zehao Wang,Jin Tang,Jianqin Yin
2024-08-09
Abstract:In the Sound Event Localization and Detection (SELD) task, Transformer-based models have demonstrated impressive capabilities. However, the quadratic complexity of the Transformer's self-attention mechanism results in computational inefficiencies. In this paper, we propose a network architecture for SELD called SELD-Mamba, which utilizes Mamba, a selective state-space model. We adopt the Event-Independent Network V2 (EINV2) as the foundational framework and replace its Conformer blocks with bidirectional Mamba blocks to capture a broader range of contextual information while maintaining computational efficiency. Additionally, we implement a two-stage training method, with the first stage focusing on Sound Event Detection (SED) and Direction of Arrival (DoA) estimation losses, and the second stage reintroducing the Source Distance Estimation (SDE) loss. Our experimental results on the 2024 DCASE Challenge Task3 dataset demonstrate the effectiveness of the selective state-space model in SELD and highlight the benefits of the two-stage training approach in enhancing SELD performance.
Sound,Artificial Intelligence,Audio and Speech Processing
What problem does this paper attempt to address?
The paper primarily addresses the technical challenges in the Sound Event Localization and Detection (SELD) task. Specifically, the research team proposed a new architecture named SELD-MAMBA to tackle the following issues: 1. **Computational Efficiency Issue**: Although existing Transformer-based models perform well in SELD tasks, their quadratic complexity due to the self-attention mechanism leads to low computational efficiency. 2. **Source Distance Estimation Issue**: The 2024 DCASE Challenge Task 3 introduces the requirement for distance estimation of detected sound events, making the SELD task more complex. To address these issues, the research team designed a new network structure called SELD-MAMBA, utilizing a Selective State-Space Model (SSM). This model replaces the Conformer module in the original framework with Bidirectional Mamba blocks (BMamba), which can capture broader contextual information while maintaining linear computational complexity. Additionally, they proposed a two-stage training method, initially focusing on Sound Event Detection (SED) and Direction of Arrival (DoA) estimation tasks, and then introducing the task loss of Source Distance Estimation (SDE) to achieve a performance balance among different tasks. Experimental results show that SELD-MAMBA outperforms baseline models and other advanced models in multiple evaluation metrics, with fewer parameters and lower computational costs. These results validate the effectiveness and efficiency of the Selective State-Space Model in SELD tasks.