FAST: Fast Audio Spectrogram Transformer

Anugunj Naman,Gaibo Zhang
2025-01-02
Abstract:In audio classification, developing efficient and robust models is critical for real-time applications. Inspired by the design principles of MobileViT, we present FAST (Fast Audio Spectrogram Transformer), a new architecture that combines convolutional neural networks (CNNs) and transformers to capitalize on the strengths of both. FAST integrates the local feature extraction efficiencies of CNNs with the global context modeling capabilities of transformers, resulting in a model that is powerful yet lightweight, well-suited to a real-time or mobile use case. Additionally, we incorporate Lipschitz continuous attention mechanisms to improve training stability and accelerate convergence. We evaluate FAST on the ADIMA dataset, a multilingual corpus towards real-time profanity and abuse detection, as well as on the more traditional AudioSet. Our results show that FAST achieves state-of-the-art performance on both the ADIMA and AudioSet classification tasks and in some cases surpasses existing benchmarks while using up to 150x fewer parameters.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to develop models that are both efficient and robust in audio classification tasks to meet the requirements of real - time applications. Specifically, the paper proposes FAST (Fast Audio Spectrogram Transformer), aiming to combine the advantages of Convolutional Neural Networks (CNNs) and Transformers, thereby significantly reducing the number of model parameters while maintaining high performance and improving computational efficiency. ### Problem Background Significant progress has been made in the field of audio classification, especially in the application of deep - learning models. Traditional Convolutional Neural Networks (CNNs) perform well in audio classification because of their local feature extraction ability, but they have limitations when dealing with long - distance dependencies. On the other hand, Transformers can capture global context information well through the self - attention mechanism, but in the case of limited computational resources, their computational complexity is high and the training is unstable. ### Solution To address these challenges, the FAST architecture combines the advantages of CNNs and Transformers and introduces the Lipschitz continuity mechanism to enhance training stability. Specific improvements include: 1. **Combining CNNs and Transformers**: FAST uses CNNs for efficient local feature extraction and Transformers for global context modeling at the same time. This combination enables the model to capture both local features and process global information, which is suitable for applications on real - time or mobile devices. 2. **Lipschitz Continuity Mechanism**: - **CenterNorm**: Replace LayerNorm to ensure gradient stability. - **Scaled Cosine Similarity Attention (SCSA)**: Use cosine similarity instead of dot - product attention to prevent the unlimited growth of attention scores. - **Weighted Residual Shortcuts (WRS)**: Introduce weighted residual connections to control the magnitude of residual changes and ensure model stability. 3. **Lightweight Design**: By optimizing the structure, FAST uses fewer parameters (up to 150 times less than existing models), thus improving computational efficiency and inference speed. ### Experimental Verification The paper evaluated the performance of FAST on two datasets: - **ADIMA Dataset**: A multilingual corpus for real - time abuse and misuse detection. - **AudioSet Dataset**: A large - scale audio event classification dataset. The experimental results show that FAST has reached the state - of - the - art performance level in multiple languages, especially having significant advantages in the number of parameters and inference time. ### Summary By combining the advantages of CNNs and Transformers and introducing the Lipschitz continuity mechanism, FAST has successfully resolved the contradiction between high performance and computational efficiency in audio classification, providing a new solution for real - time audio processing.