Abstract:Audio tagging is an important task of mapping audio samples to their corresponding categories. Recently endeavours that exploit transformer models in this field have achieved great success. However, the quadratic self-attention cost limits the scaling of audio transformer models and further constrains the development of more universal audio models. In this paper, we attempt to solve this problem by proposing Audio Mamba, a self-attention-free approach that captures long audio spectrogram dependency with state space models. Our experimental results on two audio-tagging datasets demonstrate the parameter efficiency of Audio Mamba, it achieves comparable results to SOTA audio spectrogram transformers with one third parameters.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is that in the existing audio tagging tasks, the self - attention mechanism based on the Transformer model has a high computational complexity (O(n²)), which limits the scalability and efficiency of the model when processing long - audio sequences. Specifically: 1. **Problem Background**: - Audio tagging is an important task of mapping audio samples to their corresponding categories. - Recently, significant success has been achieved in this field by using the Transformer model. - However, the quadratic self - attention cost in the Transformer model limits the expansion of audio Transformer models and further restricts the development of more general - purpose audio models. 2. **Proposed Method**: - To solve this problem, the paper proposes a model named Audio Mamba, which adopts a self - attention - free method and captures long - audio spectrogram dependencies through state space models (SSMs). - The Audio Mamba model aims to process audio data with linear time complexity, thereby improving parameter efficiency and model performance. 3. **Main Contributions**: - The Audio Mamba architecture is proposed, which is the first attempt to apply the Mamba architecture to the audio tagging task. - By combining the advantages of HT - SAT and VMamba, Audio Mamba can capture and process audio features at multiple scales. - The experimental results on two audio tagging datasets show that Audio Mamba can still achieve performance comparable to the existing best models with a reduced number of parameters. 4. **Innovations**: - Using state space models instead of the traditional self - attention mechanism reduces the computational complexity. - The introduction of a multi - stage architecture and a specific patch - embedding extraction method enhances the flexibility and scalability of the model. Through these improvements, the Audio Mamba model not only improves parameter efficiency but also shows strong performance potential in the audio tagging task.

Audio Mamba: Pretrained Audio State Space Model For Audio Tagging

Audio Mamba: Bidirectional State Space Model for Audio Representation Learning

SSAMBA: Self-Supervised Audio Representation Learning with Mamba State Space Model

Audio Mamba: Selective State Spaces for Self-Supervised Audio Representations

Utterance-Based Audio Sentiment Analysis Learned by a Parallel Combination of CNN and LSTM.

Cross-attention Inspired Selective State Space Models for Target Sound Extraction

Selective State Space Model for Monaural Speech Enhancement

Mamba in Speech: Towards an Alternative to Self-Attention

SHMamba: Structured Hyperbolic State Space Model for Audio-Visual Question Answering

Streaming Audio Transformers for Online Audio Tagging

SepMamba: State-space models for speaker separation using Mamba

Mamba-ND: Selective State Space Modeling for Multi-Dimensional Data

Dual-path Mamba: Short and Long-term Bidirectional Selective Structured State Space Models for Speech Separation

Speech-Mamba: Long-Context Speech Recognition with Selective State Spaces Models

Audio Tagging with Compact Feedforward Sequential Memory Network and Audio-to-Audio Ratio Based Data Augmentation

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Spectral-Spatial Mamba for Hyperspectral Image Classification

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality