Abstract:In audio classification, developing efficient and robust models is critical for real-time applications. Inspired by the design principles of MobileViT, we present FAST (Fast Audio Spectrogram Transformer), a new architecture that combines convolutional neural networks (CNNs) and transformers to capitalize on the strengths of both. FAST integrates the local feature extraction efficiencies of CNNs with the global context modeling capabilities of transformers, resulting in a model that is powerful yet lightweight, well-suited to a real-time or mobile use case. Additionally, we incorporate Lipschitz continuous attention mechanisms to improve training stability and accelerate convergence. We evaluate FAST on the ADIMA dataset, a multilingual corpus towards real-time profanity and abuse detection, as well as on the more traditional AudioSet. Our results show that FAST achieves state-of-the-art performance on both the ADIMA and AudioSet classification tasks and in some cases surpasses existing benchmarks while using up to 150x fewer parameters.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to develop models that are both efficient and robust in audio classification tasks to meet the requirements of real - time applications. Specifically, the paper proposes FAST (Fast Audio Spectrogram Transformer), aiming to combine the advantages of Convolutional Neural Networks (CNNs) and Transformers, thereby significantly reducing the number of model parameters while maintaining high performance and improving computational efficiency. ### Problem Background Significant progress has been made in the field of audio classification, especially in the application of deep - learning models. Traditional Convolutional Neural Networks (CNNs) perform well in audio classification because of their local feature extraction ability, but they have limitations when dealing with long - distance dependencies. On the other hand, Transformers can capture global context information well through the self - attention mechanism, but in the case of limited computational resources, their computational complexity is high and the training is unstable. ### Solution To address these challenges, the FAST architecture combines the advantages of CNNs and Transformers and introduces the Lipschitz continuity mechanism to enhance training stability. Specific improvements include: 1. **Combining CNNs and Transformers**: FAST uses CNNs for efficient local feature extraction and Transformers for global context modeling at the same time. This combination enables the model to capture both local features and process global information, which is suitable for applications on real - time or mobile devices. 2. **Lipschitz Continuity Mechanism**: - **CenterNorm**: Replace LayerNorm to ensure gradient stability. - **Scaled Cosine Similarity Attention (SCSA)**: Use cosine similarity instead of dot - product attention to prevent the unlimited growth of attention scores. - **Weighted Residual Shortcuts (WRS)**: Introduce weighted residual connections to control the magnitude of residual changes and ensure model stability. 3. **Lightweight Design**: By optimizing the structure, FAST uses fewer parameters (up to 150 times less than existing models), thus improving computational efficiency and inference speed. ### Experimental Verification The paper evaluated the performance of FAST on two datasets: - **ADIMA Dataset**: A multilingual corpus for real - time abuse and misuse detection. - **AudioSet Dataset**: A large - scale audio event classification dataset. The experimental results show that FAST has reached the state - of - the - art performance level in multiple languages, especially having significant advantages in the number of parameters and inference time. ### Summary By combining the advantages of CNNs and Transformers and introducing the Lipschitz continuity mechanism, FAST has successfully resolved the contradiction between high performance and computational efficiency in audio classification, providing a new solution for real - time audio processing.

FAST: Fast Audio Spectrogram Transformer

FastAST: Accelerating Audio Spectrogram Transformer via Token Merging and Cross-Model Knowledge Distillation

FastSpeech: Fast, Robust and Controllable Text to Speech

FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization

FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech Synthesis

FastFit: Towards Real-Time Iterative Neural Vocoder by Replacing U-Net Encoder With Multiple STFTs

Audio Transformers:Transformer Architectures For Large Scale Audio Understanding. Adieu Convolutions

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

FMViT: A multiple-frequency mixing Vision Transformer

Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition

From Coarse to Fine: Efficient Training for Audio Spectrogram Transformers

ElasticAST: An Audio Spectrogram Transformer for All Length and Resolutions

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition

Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification

FAST: Faster Arbitrarily-Shaped Text Detector with Minimalist Kernel Representation

Fast and accurate factorized neural transducer for text adaption of end-to-end speech recognition models

Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge Distillation

FAST: Factorizable Attention for Speeding up Transformers

TS-Fastformer: Fast Transformer for Time-Series Forecasting

FastTextSpotter: A High-Efficiency Transformer for Multilingual Scene Text Spotting