Abstract:Music creation is typically composed of two parts: composing the musical score, and then performing the score with instruments to make sounds. While recent work has made much progress in automatic music generation in the symbolic domain, few attempts have been made to build an AI model that can render realistic music audio from musical scores. Directly synthesizing audio with sound sample libraries often leads to mechanical and deadpan results, since musical scores do not contain performance-level information, such as subtle changes in timing and dynamics. Moreover, while the task may sound like a text-to-speech synthesis problem, there are fundamental differences since music audio has rich polyphonic sounds. To build such an AI performer, we propose in this paper a deep convolutional model that learns in an end-to-end manner the score-to-audio mapping between a symbolic representation of music called the pianorolls and an audio representation of music called the spectrograms. The model consists of two subnets: the ContourNet, which uses a U-Net structure to learn the correspondence between pianorolls and spectrograms and to give an initial result; and the TextureNet, which further uses a multi-band residual network to refine the result by adding the spectral texture of overtones and timbre. We train the model to generate music clips of the violin, cello, and flute, with a dataset of moderate size. We also present the result of a user study that shows our model achieves higher mean opinion score (MOS) in naturalness and emotional expressivity than a WaveNet-based model and two off-the-shelf synthesizers. We open our source code at https://github.com/bwang514/PerformanceNet

Hierarchical Timbre-Painting and Articulation Generation

Generating Sample-Based Musical Instruments Using Neural Audio Codec Language Models

MusicHiFi: Fast High-Fidelity Stereo Vocoding

Latent Diffusion Bridges for Unsupervised Musical Audio Timbre Transfer

Introducing Latent Timbre Synthesis

Efficient Neural Music Generation

TimbreTron: A WaveNet(CycleGAN(CQT(Audio))) Pipeline for Musical Timbre Transfer

Pictures Of MIDI: Controlled Music Generation via Graphical Prompts for Image-Based Diffusion Inpainting

Timbre-Trap: A Low-Resource Framework for Instrument-Agnostic Music Transcription

Vector-Quantized Timbre Representation

Hyperbolic Timbre Embedding for Musical Instrument Sound Synthesis Based on Variational Autoencoders

DisMix: Disentangling Mixtures of Musical Instruments for Source-level Pitch and Timbre Manipulation

Timbre Transfer with Variational Auto Encoding and Cycle-Consistent Adversarial Networks

Msanii: High Fidelity Music Synthesis on a Shoestring Budget

Music Generation Using Dual Interactive Wasserstein Fourier Acquisitive Generative Adversarial Network

From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion

Hierarchical Generative Modeling of Melodic Vocal Contours in Hindustani Classical Music

MIDI-DDSP: Detailed Control of Musical Performance via Hierarchical Modeling

Learning Disentangled Representations of Timbre and Pitch for Musical Instrument Sounds Using Gaussian Mixture Variational Autoencoders

PerformanceNet: Score-to-Audio Music Generation with Multi-Band Convolutional Residual Network

Presto! Distilling Steps and Layers for Accelerating Music Generation