SELD-Mamba: Selective State-Space Model for Sound Event Localization and Detection with Source Distance Estimation

Da Mu,Zhicheng Zhang,Haobo Yue,Zehao Wang,Jin Tang,Jianqin Yin

2024-08-09

Abstract:In the Sound Event Localization and Detection (SELD) task, Transformer-based models have demonstrated impressive capabilities. However, the quadratic complexity of the Transformer's self-attention mechanism results in computational inefficiencies. In this paper, we propose a network architecture for SELD called SELD-Mamba, which utilizes Mamba, a selective state-space model. We adopt the Event-Independent Network V2 (EINV2) as the foundational framework and replace its Conformer blocks with bidirectional Mamba blocks to capture a broader range of contextual information while maintaining computational efficiency. Additionally, we implement a two-stage training method, with the first stage focusing on Sound Event Detection (SED) and Direction of Arrival (DoA) estimation losses, and the second stage reintroducing the Source Distance Estimation (SDE) loss. Our experimental results on the 2024 DCASE Challenge Task3 dataset demonstrate the effectiveness of the selective state-space model in SELD and highlight the benefits of the two-stage training approach in enhancing SELD performance.

Sound,Artificial Intelligence,Audio and Speech Processing

What problem does this paper attempt to address?

The paper primarily addresses the technical challenges in the Sound Event Localization and Detection (SELD) task. Specifically, the research team proposed a new architecture named SELD-MAMBA to tackle the following issues: 1. **Computational Efficiency Issue**: Although existing Transformer-based models perform well in SELD tasks, their quadratic complexity due to the self-attention mechanism leads to low computational efficiency. 2. **Source Distance Estimation Issue**: The 2024 DCASE Challenge Task 3 introduces the requirement for distance estimation of detected sound events, making the SELD task more complex. To address these issues, the research team designed a new network structure called SELD-MAMBA, utilizing a Selective State-Space Model (SSM). This model replaces the Conformer module in the original framework with Bidirectional Mamba blocks (BMamba), which can capture broader contextual information while maintaining linear computational complexity. Additionally, they proposed a two-stage training method, initially focusing on Sound Event Detection (SED) and Direction of Arrival (DoA) estimation tasks, and then introducing the task loss of Source Distance Estimation (SDE) to achieve a performance balance among different tasks. Experimental results show that SELD-MAMBA outperforms baseline models and other advanced models in multiple evaluation metrics, with fewer parameters and lower computational costs. These results validate the effectiveness and efficiency of the Selective State-Space Model in SELD tasks.

SELD-Mamba: Selective State-Space Model for Sound Event Localization and Detection with Source Distance Estimation

A Study of Improved Two-Stage Dual-Conv Coordinate Attention Model for Sound Event Detection and Localization

Dynamic Kernel Convolution Network with Scene-dedicate Training for Sound Event Localization and Detection

Sound Event Localization and Detection Using Imbalanced Real and Synthetic Data via Multi-Generator

Sound Event Localization and Detection for Real Spatial Sound Scenes: Event-Independent Network and Data Augmentation Chains

Squeeze-and-Excite ResNet-Conformers for Sound Event Localization, Detection, and Distance Estimation for DCASE 2024 Challenge

MVANet: Multi-Stage Video Attention Network for Sound Event Localization and Detection with Source Distance Estimation

A Model Ensemble Approach for Sound Event Localization and Detection.

CST-former: Transformer with Channel-Spectro-Temporal Attention for Sound Event Localization and Detection

A Four-Stage Data Augmentation Approach to ResNet-Conformer Based Acoustic Modeling for Sound Event Localization and Detection

MFF-EINV2: Multi-scale Feature Fusion across Spectral-Spatial-Temporal Domains for Sound Event Localization and Detection

Fusion of Audio and Visual Embeddings for Sound Event Localization and Detection

PSELDNets: Pre-trained Neural Networks on Large-scale Synthetic Datasets for Sound Event Localization and Detection

Sound Event Localization and Detection Based on Multiple DOA Beamforming and Multi-Task Learning

An Experimental Study on Sound Event Localization and Detection under Realistic Testing Conditions

ICASSP 2022 L3DAS22 Challenge: Ensemble of Resnet-Conformers with Ambisonics Data Augmentation for Sound Event Localization and Detection

Joint Spatio-Temporal-Frequency Representation Learning for Improved Sound Event Localization and Detection

Improving Sound Event Localization and Detection with Class-Dependent Sound Separation for Real-World Scenarios

MAT-SED: A Masked Audio Transformer with Masked-Reconstruction Based Pre-training for Sound Event Detection

Dual-path Mamba: Short and Long-term Bidirectional Selective Structured State Space Models for Speech Separation

Assessment of Self-Attention on Learned Features For Sound Event Localization and Detection