Abstract:Pre-trained models are essential as feature extractors in modern machine learning systems in various domains. In this study, we hypothesize that representations effective for general audio tasks should provide multiple aspects of robust features of the input sound. For recognizing sounds regardless of perturbations such as varying pitch or timbre, features should be robust to these perturbations. For serving the diverse needs of tasks such as recognition of emotions or music genres, representations should provide multiple aspects of information, such as local and global features. To implement our principle, we propose a self-supervised learning method: Bootstrap Your Own Latent (BYOL) for Audio (BYOL-A, pronounced viola). BYOL-A pre-trains representations of the input sound invariant to audio data augmentations, which makes the learned representations robust to the perturbations of sounds. Whereas the BYOL-A encoder combines local and global features and calculates their statistics to make the representation provide multi-aspect information. As a result, the learned representations should provide robust and multi-aspect information to serve various needs of diverse tasks. We evaluated the general audio task performance of BYOL-A compared to previous state-of-the-art methods, and BYOL-A demonstrated generalizability with the best average result of 72.4 and the best VoxCeleb1 result of 57.6. Extensive ablation experiments revealed that the BYOL-A encoder architecture contributes to most performance, and the final critical portion resorts to the BYOL framework and BYOL-A augmentations. Our code is available online for future studies.

AUDIO ALBERT: A LITE BERT FOR SELF-SUPERVISED LEARNING OF AUDIO REPRESENTATION

[RE] ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Can you Remove the Downstream Model for Speaker Recognition with Self-Supervised Speech Features?

TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech

Sustainable self-supervised learning for speech representations

Probing self-supervised speech models for phonetic and phonemic information: a case study in aspiration

Self-Supervised Audio-Visual Speech Representations Learning by Multimodal Self-Distillation

LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT

MelHuBERT: A simplified HuBERT on Mel spectrograms

BYOL for Audio: Exploring Pre-Trained General-Purpose Audio Representations

A Light Weight Model for Active Speaker Detection

Speech-XLNet: Unsupervised Acoustic Model Pretraining for Self-Attention Networks

SUPERB: Speech Understanding and PERformance Benchmark

Selective HuBERT: Self-Supervised Pre-Training for Target Speaker in Clean and Mixture Speech

Improving Automatic Speech Recognition Performance for Low-Resource Languages With Self-Supervised Models

On the Transferability of Large-Scale Self-Supervision to Few-Shot Audio Classification

Utilizing Self-supervised Representations for MOS Prediction

ARoBERT: An ASR Robust Pre-Trained Language Model for Spoken Language Understanding

CoBERT: Self-Supervised Speech Representation Learning Through Code Representation Learning

Efficient Adapter Transfer of Self-Supervised Speech Models for Automatic Speech Recognition