Abstract:Music generation schemes using language modeling rely on a vocabulary of audio tokens, generally provided as codes in a discrete latent space learnt by an auto-encoder. Multi-stage quantizers are often employed to produce these tokens, therefore the decoding strategy used for token prediction must be adapted to account for multiple codebooks: either it should model the joint distribution over all codebooks, or fit the product of the codebook marginal distributions. Modelling the joint distribution requires a costly increase in the number of auto-regressive steps, while fitting the product of the marginals yields an inexact model unless the codebooks are mutually independent. In this work, we introduce an independence-promoting loss to regularize the auto-encoder used as the tokenizer in language models for music generation. The proposed loss is a proxy for mutual information based on the maximum mean discrepancy principle, applied in reproducible kernel Hilbert spaces. Our criterion is simple to implement and train, and it is generalizable to other multi-stream codecs. We show that it reduces the statistical dependence between codebooks during auto-encoding. This leads to an increase in the generated music quality when modelling the product of the marginal distributions, while generating audio much faster than the joint distribution model.

What problem does this paper attempt to address?

This paper attempts to solve the problem of statistical dependence between audio codebooks generated by multi - stage quantizers in language - model - based music generation. Specifically: 1. **Problem Background**: - When using a language model to generate music, it is usually necessary to encode audio signals into discrete audio tokens, which are learned by an auto - encoder and represented in a discrete latent space. - Multi - stage quantizers are used to generate these tokens, with each stage using a different codebook. Therefore, the decoding strategy must adapt to multiple codebooks: either model the joint distribution of all codebooks or fit the product of the codebook marginal distributions. 2. **Limitations of Existing Methods**: - Modeling the joint distribution requires a significant increase in the number of autoregressive steps, resulting in high computational costs. - Fitting the product of the marginal distributions simplifies the training and inference processes, but can only provide an accurate model when the codebooks are independent of each other. However, in practice, there are often dependencies between codebooks, which can lead to a decline in the generation quality. 3. **Solution Proposed in the Paper**: - The authors introduce an independence - promoting loss function to regularize the auto - encoder in the language model used for music generation. - This loss function is based on the Maximum Mean Discrepancy (MMD) principle and is applied in reproducible kernel Hilbert spaces as a proxy for mutual information. - By optimizing this loss function, the statistical dependence between codebooks can be reduced, thereby improving the quality of the generated music and maintaining a fast generation speed when modeling the product of the marginal distributions. 4. **Main Contributions**: - It is verified that MMD is a reasonable proxy for measuring independence in reproducible kernel Hilbert spaces, and optimizing this criterion can reduce the mutual information between codebooks during the auto - encoding process. - An improved version of the loss function is proposed, which matches the decoding strategy for token prediction and performs best especially when applied to the "delayed" strategy. - Experimental results show that the language model trained with the proposed independence loss outperforms other baseline models in objective and subjective music - generation quality scores while maintaining the same number of parameters and generation speed. Through this method, the paper effectively solves the dependency problem between codebooks and improves the quality and efficiency of music generation.

An Independence-promoting Loss for Music Generation with Language Models

Audio Conditioning for Music Generation via Discrete Bottleneck Features

Simple and Controllable Music Generation

Language-Codec: Reducing the Gaps Between Discrete Codec Representation and Speech Language Models

Generating Stereophonic Music with Single-Stage Language Models.

AudioLM: a Language Modeling Approach to Audio Generation

Using Random Codebooks for Audio Neural AutoEncoders

Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model

Audio Language Modeling using Perceptually-Guided Discrete Representations

Analyzing and Mitigating Inconsistency in Discrete Audio Tokens for Neural Codec Language Models

Multi-Source Music Generation with Latent Diffusion

Impromptu Accompaniment of Pop Music Using Coupled Latent Variable Model with Binary Regularizer

Learning Source Disentanglement in Neural Audio Codec

Generative De-Quantization for Neural Speech Codec via Latent Diffusion

Modulated Variational auto-Encoders for many-to-many musical timbre transfer

Learning Style-Aware Symbolic Music Representations by Adversarial Autoencoders

Music Generation based on Generative Adversarial Networks with Transformer

MiLe Loss: a New Loss for Mitigating the Bias of Learning Difficulties in Generative Language Models

Efficient Neural Music Generation

Autoencoders for music sound modeling: a comparison of linear, shallow, deep, recurrent and variational models

Music Generation System for Adversarial Training Based on Deep Learning