Abstract:Transformer structures have demonstrated outstanding skills in the deep learning space recently, significantly increasing the accuracy of models across a variety of domains. Researchers have started to question whether such a sophisticated network structure is actually necessary and whether equally outstanding results can be reached with reduced inference cost due to its complicated network topology and high inference cost. In order to prove the Mixer's efficacy on three datasets Speech Commands, UrbanSound8k, and CASIA Chinese Sentiment Corpus this paper applies amore condensed version of the Mixer to an audio classification task and conducts comparative experiments with the Transformer-based Audio Spectrogram Transformer (AST)model. In addition, this paper conducts comparative experiments on the application of several activation functions in Mixer, namely GeLU, Mish, Swish and Acon-C. Further-more, the use of various activation functions in Mixer, including GeLU, Mish, Swish, and Acon-C, is compared in this research through comparison experiments. Additionally, some AST model flaws are highlighted, and the model suggested in this study is improved as a result. In conclusion, a model called the Audio Spectrogram Mixer, which is the first model for audio classification with Mixer, is suggested in this study and the model's future directions for improvement are examined.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to explore whether using a simpler network structure - Mixer (specifically Audio Spectrogram Mixer, ASM) can achieve performance comparable to or even better than the complex and computationally expensive Transformer structure in audio classification tasks. Specifically, the paper explores this issue through the following aspects: 1. **Model Structure Simplification**: The paper proposes a new model based on Mixer - Audio Spectrogram Mixer (ASM), aiming to reduce the complexity and computational cost of the model while maintaining or improving the performance of audio classification tasks. 2. **Comparative Experiments**: The paper conducts comparative experiments on three public datasets (Speech Commands, UrbanSound8k, and CASIA Chinese Sentiment Corpus), comparing the performance of the ASM model with that of the Transformer - based Audio Spectrogram Transformer (AST) model respectively. 3. **Activation Function Selection**: The paper also studies the influence of different activation functions (GeLU, Mish, Swish, and Acon - C) on the performance of the ASM model to determine the optimal activation function configuration. 4. **Influence of Pretrained Models**: The paper explores how the ASM model performs when using pretrained models (such as ImageNet pretrained models) and verifies the effectiveness of using the RGB - to - grayscale conversion formula. 5. **Optimization Methods**: The paper proposes some optimization methods, such as adjusting the shape of the Mixer block and using self - supervised training methods to solve the mismatch problem when migrating from visual models to audio tasks. Through these studies, the paper hopes to prove that in audio classification tasks, using a simpler Mixer structure can not only reduce computational costs but also achieve performance comparable to or even better than the Transformer structure. This will provide new ideas and technical paths for future audio processing tasks.

ASM: Audio Spectrogram Mixer

Mixer is more than just a model

Audio Sentiment Analysis by Heterogeneous Signal Features Learned from Utterance-Based Parallel Neural Network.

Utterance-Based Audio Sentiment Analysis Learned by a Parallel Combination of CNN and LSTM.

TCAMixer: A lightweight Mixer based on a novel triple concepts attention mechanism for NLP

FastAST: Accelerating Audio Spectrogram Transformer via Token Merging and Cross-Model Knowledge Distillation

AMixer: Adaptive Weight Mixing for Self-attention Free Vision Transformers.

Multiscale Audio Spectrogram Transformer for Efficient Audio Classification

AMPLIFY:Attention-based Mixup for Performance Improvement and Label Smoothing in Transformer

AMPLIFY: attention-based mixup for performance improvement and label smoothing in transformer

MAST: Multiscale Audio Spectrogram Transformers

TransMix: Attend to Mix for Vision Transformers

Audio Mamba: Bidirectional State Space Model for Audio Representation Learning

TS‐Mixer: A lightweight text representation model based on context awareness

Active Token Mixer

MixCon: A Hybrid Architecture for Efficient and Adaptive Sequence Modeling

ChebMixer: Efficient Graph Representation Learning with MLP Mixer

From Coarse to Fine: Efficient Training for Audio Spectrogram Transformers

Environmental sound analysis with mixup based multitask learning and cross-task fusion

MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection