Audio Mamba: Pretrained Audio State Space Model For Audio Tagging

Jiaju Lin,Haoxuan Hu
2024-05-22
Abstract:Audio tagging is an important task of mapping audio samples to their corresponding categories. Recently endeavours that exploit transformer models in this field have achieved great success. However, the quadratic self-attention cost limits the scaling of audio transformer models and further constrains the development of more universal audio models. In this paper, we attempt to solve this problem by proposing Audio Mamba, a self-attention-free approach that captures long audio spectrogram dependency with state space models. Our experimental results on two audio-tagging datasets demonstrate the parameter efficiency of Audio Mamba, it achieves comparable results to SOTA audio spectrogram transformers with one third parameters.
Sound,Artificial Intelligence,Audio and Speech Processing
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is that in the existing audio tagging tasks, the self - attention mechanism based on the Transformer model has a high computational complexity (O(n²)), which limits the scalability and efficiency of the model when processing long - audio sequences. Specifically: 1. **Problem Background**: - Audio tagging is an important task of mapping audio samples to their corresponding categories. - Recently, significant success has been achieved in this field by using the Transformer model. - However, the quadratic self - attention cost in the Transformer model limits the expansion of audio Transformer models and further restricts the development of more general - purpose audio models. 2. **Proposed Method**: - To solve this problem, the paper proposes a model named Audio Mamba, which adopts a self - attention - free method and captures long - audio spectrogram dependencies through state space models (SSMs). - The Audio Mamba model aims to process audio data with linear time complexity, thereby improving parameter efficiency and model performance. 3. **Main Contributions**: - The Audio Mamba architecture is proposed, which is the first attempt to apply the Mamba architecture to the audio tagging task. - By combining the advantages of HT - SAT and VMamba, Audio Mamba can capture and process audio features at multiple scales. - The experimental results on two audio tagging datasets show that Audio Mamba can still achieve performance comparable to the existing best models with a reduced number of parameters. 4. **Innovations**: - Using state space models instead of the traditional self - attention mechanism reduces the computational complexity. - The introduction of a multi - stage architecture and a specific patch - embedding extraction method enhances the flexibility and scalability of the model. Through these improvements, the Audio Mamba model not only improves parameter efficiency but also shows strong performance potential in the audio tagging task.