Abstract:In this paper, we propose and investigate the use of neural audio codec language models for the automatic generation of sample-based musical instruments based on text or reference audio prompts. Our approach extends a generative audio framework to condition on pitch across an 88-key spectrum, velocity, and a combined text/audio embedding. We identify maintaining timbral consistency within the generated instruments as a major challenge. To tackle this issue, we introduce three distinct conditioning schemes. We analyze our methods through objective metrics and human listening tests, demonstrating that our approach can produce compelling musical instruments. Specifically, we introduce a new objective metric to evaluate the timbral consistency of the generated instruments and adapt the average Contrastive Language-Audio Pretraining (CLAP) score for the text-to-instrument case, noting that its naive application is unsuitable for assessing this task. Our findings reveal a complex interplay between timbral consistency, the quality of generated samples, and their correspondence to the input prompt.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to automatically generate sample - based musical instruments from text or reference audio prompts and ensure the timbral consistency of the generated instruments. Specifically, the author proposes and studies the method of using neural audio codec language models to achieve this goal. ### Main Problems and Challenges 1. **Timbral Consistency**: - The generated instruments need to maintain a consistent timbre at different pitches and velocities. This is a key challenge in generating high - quality, playable instruments. - The paper introduces three different conditioning schemes to address this challenge and develops a new objective metric to evaluate the timbral consistency of the generated instruments. 2. **Text - to - Instrument (T2I) Task**: - A new task is proposed, that is, to generate instrument waveforms from text prompts provided by users. This requires the model to be able to understand text semantics and convert them into corresponding audio features. - Adapt the CLAP score (Contrastive Language - Audio Pretraining score) for the T2I task, because directly applying the CLAP score is not suitable for evaluating this task. 3. **Diversity and Controllability of the Generative Model**: - The model not only needs to generate realistic instrument samples but also needs to be able to perform fine - grained control according to input prompts (text or audio) to meet the actual needs in music production. ### Solution Overview - **Neural Audio Codec Language Model**: The author extends a generative audio framework so that it can perform conditional generation according to pitch, velocity, and combined text/audio embeddings. - **Three Conditioning Schemes**: - **Baseline CLAP**: Directly use CLAP embeddings for conditioning. - **Random CLAP**: Randomly select CLAP embeddings at different pitches and velocities to reduce dependence on specific pitches and velocities. - **Fixed CLAP**: Use fixed CLAP embeddings for each instrument to ensure the consistency of the generated samples during training and inference. - **New Metric**: Introduce the TC (Timbral Consistency) score to quantify the timbral consistency of the generated instruments. - **Improved CLAP Score**: Propose an adaptation method for the average CLAP score to make it more suitable for the T2I task. ### Experimental Results Through objective and subjective evaluations, the author shows that the proposed model can generate high - quality and timbrally consistent instrument samples in S2I (Sample - to - Instrument) and T2I tasks. In particular, the fixed CLAP variant performs best in terms of timbral consistency, while the random CLAP variant performs well in terms of overall expressiveness and fidelity. In conclusion, this paper aims to solve the problem of automatically generating high - quality, timbrally consistent instrument samples from text or audio prompts through neural audio codec language models, and for this purpose, introduces a variety of innovative methods and techniques.

Generating Sample-Based Musical Instruments Using Neural Audio Codec Language Models

InstrumentGen: Generating Sample-Based Musical Instruments From Text

Audio Conditioning for Music Generation via Discrete Bottleneck Features

Conditional Sound Generation Using Neural Discrete Time-Frequency Representation Learning

PerformanceNet: Score-to-Audio Music Generation with Multi-Band Convolutional Residual Network

Intelligent Text-Conditioned Music Generation

Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders

Conditioning Deep Generative Raw Audio Models for Structured Automatic Music

SING: Symbol-to-Instrument Neural Generator

Generating Nontrivial Melodies for Music as a Service

Efficient Neural Music Generation

MusiConGen: Rhythm and Chord Control for Transformer-Based Text-to-Music Generation

Noise2Music: Text-conditioned Music Generation with Diffusion Models

MusicLM: Generating Music From Text

AudioLM: a Language Modeling Approach to Audio Generation

Simple and Controllable Music Generation

MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies

MIDI-Sandwich: Multi-model Multi-task Hierarchical Conditional VAE-GAN networks for Symbolic Single-track Music Generation

Hierarchical Timbre-Painting and Articulation Generation

Language Models are Drummers: Drum Composition with Natural Language Pre-Training

Can Knowledge of End-to-End Text-to-Speech Models Improve Neural MIDI-to-Audio Synthesis Systems?