Abstract:Transformers, which were originally developed for natural language processing, have recently generated significant interest in the computer vision and audio communities due to their flexibility in learning long-range relationships. Constrained by the data hungry nature of transformers and the limited amount of labelled data, most transformer-based models for audio tasks are finetuned from ImageNet pretrained models, despite the huge gap between the domain of natural images and audio. This has motivated the research in self-supervised pretraining of audio transformers, which reduces the dependency on large amounts of labeled data and focuses on extracting concise representations of audio spectrograms. In this paper, we propose \textbf{L}ocal-\textbf{G}lobal \textbf{A}udio \textbf{S}pectrogram v\textbf{I}sion \textbf{T}ransformer, namely ASiT, a novel self-supervised learning framework that captures local and global contextual information by employing group masked model learning and self-distillation. We evaluate our pretrained models on both audio and speech classification tasks, including audio event classification, keyword spotting, and speaker identification. We further conduct comprehensive ablation studies, including evaluations of different pretraining strategies. The proposed ASiT framework significantly boosts the performance on all tasks and sets a new state-of-the-art performance in five audio and speech classification tasks, outperforming recent methods, including the approaches that use additional datasets for pretraining.

ElasticAST: An Audio Spectrogram Transformer for All Length and Resolutions

From Coarse to Fine: Efficient Training for Audio Spectrogram Transformers

FlexiAST: Flexibility is What AST Needs

FastAST: Accelerating Audio Spectrogram Transformer via Token Merging and Cross-Model Knowledge Distillation

MicroAST: Towards Super-Fast Ultra-Resolution Arbitrary Style Transfer

MAST: Multiscale Audio Spectrogram Transformers

Efficient Supervised Training of Audio Transformers for Music Representation Learning

Multiscale Audio Spectrogram Transformer for Efficient Audio Classification

Echotune: A Modular Extractor Leveraging the Variable-Length Nature of Speech in ASR Tasks

AAT: Adapting Audio Transformer for Various Acoustics Recognition Tasks

ASM: Audio Spectrogram Mixer

Streaming Audio Transformers for Online Audio Tagging

End-to-End ASR with Adaptive Span Self-Attention

Audio Transformers:Transformer Architectures For Large Scale Audio Understanding. Adieu Convolutions

Adapter Incremental Continual Learning of Efficient Audio Spectrogram Transformers

Parameter-Efficient Transfer Learning of Audio Spectrogram Transformers

ASiT: Local-Global Audio Spectrogram vIsion Transformer for Event Classification

SpecTNT: a Time-Frequency Transformer for Music Audio

Asca: less audio data is more insightful

CAT: Causal Audio Transformer for Audio Classification

Depth-Aware Sparse Transformer for Video-Language Learning