EnCodecMAE: Leveraging neural codecs for universal audio representation learning

Leonardo Pepino,Pablo Riera,Luciana Ferrer

2024-05-21

Abstract:The goal of universal audio representation learning is to obtain foundational models that can be used for a variety of downstream tasks involving speech, music and environmental sounds. To approach this problem, methods inspired by works on self-supervised learning for NLP, like BERT, or computer vision, like masked autoencoders (MAE), are often adapted to the audio domain. In this work, we propose masking representations of the audio signal, and training a MAE to reconstruct the masked segments. The reconstruction is done by predicting the discrete units generated by EnCodec, a neural audio codec, from the unmasked inputs. We evaluate this approach, which we call EnCodecMAE, on a wide range of tasks involving speech, music and environmental sounds. Our best model outperforms various state-of-the-art audio representation models in terms of global performance. Additionally, we evaluate the resulting representations in the challenging task of automatic speech recognition (ASR), obtaining decent results and paving the way for a universal audio representation.

Sound,Machine Learning,Audio and Speech Processing

What problem does this paper attempt to address?

The main goal of this paper is to propose a new method for universal audio representation learning, which can acquire a foundational model applicable to various downstream tasks, including speech, music, and environmental sounds. To address this issue, the authors designed a model named EnCodecMAE. The main innovations of EnCodecMAE are as follows: 1. **Using neural audio codec as the target**: The model utilizes discrete units generated by EnCodec (a neural audio codec) as the training target, thereby learning universal audio representations. 2. **Adopting a masked autoencoder architecture**: The model employs a Masked Autoencoder (MAE) architecture to process audio signals. Unlike traditional masking methods, the masked embeddings are discarded rather than replaced with mask tokens, which improves training efficiency. 3. **Performance on various tasks**: The authors evaluated EnCodecMAE's performance on a range of tasks involving speech, music, and environmental sounds, and it outperformed existing universal audio representation models in overall performance. 4. **Self-training phase**: The model is further improved through an additional self-training phase, which uses targets obtained from k-means clustering for extra training to optimize the representations. 5. **Performance on automatic speech recognition tasks**: Although not the primary focus, the authors also reported EnCodecMAE's performance on Automatic Speech Recognition (ASR) tasks, showing that the model has some potential in this area. In summary, EnCodecMAE aims to develop a universal audio representation learning model that excels in various audio-related tasks by combining the advantages of masked autoencoder technology and the EnCodec codec.

EnCodecMAE: Leveraging neural codecs for universal audio representation learning

CrossMAE: Cross Modality Masked Autoencoders for Region-Aware Audio-Visual Pretraining

Masked Autoencoders with Multi-Window Local-Global Attention Are Better Audio Learners

Leveraging Pre-Trained Autoencoders for Interpretable Prototype Learning of Music Audio

Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representation

Audiovisual Masked Autoencoders

A-JEPA: Joint-Embedding Predictive Architecture Can Listen

AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders

Contrastive Audio-Visual Masked Autoencoder

Enhancing Representation Learning of EEG Data with Masked Autoencoders

LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders

EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning

A Multiscale Autoencoder (MSAE) Framework for End-to-End Neural Network Speech Enhancement

CL-MAE: Curriculum-Learned Masked Autoencoders

MultiMAE: Multi-modal Multi-task Masked Autoencoders

VideoMAC: Video Masked Autoencoders Meet ConvNets

HiCMAE: Hierarchical Contrastive Masked Autoencoder for self-supervised Audio-Visual Emotion Recognition

Learning Source Disentanglement in Neural Audio Codec

A vector quantized masked autoencoder for audiovisual speech emotion recognition

ESPnet-Codec: Comprehensive Training and Evaluation of Neural Codecs for Audio, Music, and Speech

Masked Autoencoders Are Scalable Vision Learners