EnCodecMAE: Leveraging neural codecs for universal audio representation learning

Leonardo Pepino,Pablo Riera,Luciana Ferrer
2024-05-21
Abstract:The goal of universal audio representation learning is to obtain foundational models that can be used for a variety of downstream tasks involving speech, music and environmental sounds. To approach this problem, methods inspired by works on self-supervised learning for NLP, like BERT, or computer vision, like masked autoencoders (MAE), are often adapted to the audio domain. In this work, we propose masking representations of the audio signal, and training a MAE to reconstruct the masked segments. The reconstruction is done by predicting the discrete units generated by EnCodec, a neural audio codec, from the unmasked inputs. We evaluate this approach, which we call EnCodecMAE, on a wide range of tasks involving speech, music and environmental sounds. Our best model outperforms various state-of-the-art audio representation models in terms of global performance. Additionally, we evaluate the resulting representations in the challenging task of automatic speech recognition (ASR), obtaining decent results and paving the way for a universal audio representation.
Sound,Machine Learning,Audio and Speech Processing
What problem does this paper attempt to address?
The main goal of this paper is to propose a new method for universal audio representation learning, which can acquire a foundational model applicable to various downstream tasks, including speech, music, and environmental sounds. To address this issue, the authors designed a model named EnCodecMAE. The main innovations of EnCodecMAE are as follows: 1. **Using neural audio codec as the target**: The model utilizes discrete units generated by EnCodec (a neural audio codec) as the training target, thereby learning universal audio representations. 2. **Adopting a masked autoencoder architecture**: The model employs a Masked Autoencoder (MAE) architecture to process audio signals. Unlike traditional masking methods, the masked embeddings are discarded rather than replaced with mask tokens, which improves training efficiency. 3. **Performance on various tasks**: The authors evaluated EnCodecMAE's performance on a range of tasks involving speech, music, and environmental sounds, and it outperformed existing universal audio representation models in overall performance. 4. **Self-training phase**: The model is further improved through an additional self-training phase, which uses targets obtained from k-means clustering for extra training to optimize the representations. 5. **Performance on automatic speech recognition tasks**: Although not the primary focus, the authors also reported EnCodecMAE's performance on Automatic Speech Recognition (ASR) tasks, showing that the model has some potential in this area. In summary, EnCodecMAE aims to develop a universal audio representation learning model that excels in various audio-related tasks by combining the advantages of masked autoencoder technology and the EnCodec codec.