An Independence-promoting Loss for Music Generation with Language Models

Jean-Marie Lemercier,Simon Rouard,Jade Copet,Yossi Adi,Alexandre Défossez
2024-06-10
Abstract:Music generation schemes using language modeling rely on a vocabulary of audio tokens, generally provided as codes in a discrete latent space learnt by an auto-encoder. Multi-stage quantizers are often employed to produce these tokens, therefore the decoding strategy used for token prediction must be adapted to account for multiple codebooks: either it should model the joint distribution over all codebooks, or fit the product of the codebook marginal distributions. Modelling the joint distribution requires a costly increase in the number of auto-regressive steps, while fitting the product of the marginals yields an inexact model unless the codebooks are mutually independent. In this work, we introduce an independence-promoting loss to regularize the auto-encoder used as the tokenizer in language models for music generation. The proposed loss is a proxy for mutual information based on the maximum mean discrepancy principle, applied in reproducible kernel Hilbert spaces. Our criterion is simple to implement and train, and it is generalizable to other multi-stream codecs. We show that it reduces the statistical dependence between codebooks during auto-encoding. This leads to an increase in the generated music quality when modelling the product of the marginal distributions, while generating audio much faster than the joint distribution model.
Sound,Artificial Intelligence,Machine Learning,Audio and Speech Processing
What problem does this paper attempt to address?
This paper attempts to solve the problem of statistical dependence between audio codebooks generated by multi - stage quantizers in language - model - based music generation. Specifically: 1. **Problem Background**: - When using a language model to generate music, it is usually necessary to encode audio signals into discrete audio tokens, which are learned by an auto - encoder and represented in a discrete latent space. - Multi - stage quantizers are used to generate these tokens, with each stage using a different codebook. Therefore, the decoding strategy must adapt to multiple codebooks: either model the joint distribution of all codebooks or fit the product of the codebook marginal distributions. 2. **Limitations of Existing Methods**: - Modeling the joint distribution requires a significant increase in the number of autoregressive steps, resulting in high computational costs. - Fitting the product of the marginal distributions simplifies the training and inference processes, but can only provide an accurate model when the codebooks are independent of each other. However, in practice, there are often dependencies between codebooks, which can lead to a decline in the generation quality. 3. **Solution Proposed in the Paper**: - The authors introduce an independence - promoting loss function to regularize the auto - encoder in the language model used for music generation. - This loss function is based on the Maximum Mean Discrepancy (MMD) principle and is applied in reproducible kernel Hilbert spaces as a proxy for mutual information. - By optimizing this loss function, the statistical dependence between codebooks can be reduced, thereby improving the quality of the generated music and maintaining a fast generation speed when modeling the product of the marginal distributions. 4. **Main Contributions**: - It is verified that MMD is a reasonable proxy for measuring independence in reproducible kernel Hilbert spaces, and optimizing this criterion can reduce the mutual information between codebooks during the auto - encoding process. - An improved version of the loss function is proposed, which matches the decoding strategy for token prediction and performs best especially when applied to the "delayed" strategy. - Experimental results show that the language model trained with the proposed independence loss outperforms other baseline models in objective and subjective music - generation quality scores while maintaining the same number of parameters and generation speed. Through this method, the paper effectively solves the dependency problem between codebooks and improves the quality and efficiency of music generation.