Abstract:Music creation is typically composed of two parts: composing the musical score, and then performing the score with instruments to make sounds. While recent work has made much progress in automatic music generation in the symbolic domain, few attempts have been made to build an AI model that can render realistic music audio from musical scores. Directly synthesizing audio with sound sample libraries often leads to mechanical and deadpan results, since musical scores do not contain performance-level information, such as subtle changes in timing and dynamics. Moreover, while the task may sound like a text-to-speech synthesis problem, there are fundamental differences since music audio has rich polyphonic sounds. To build such an AI performer, we propose in this paper a deep convolutional model that learns in an end-to-end manner the score-to-audio mapping between a symbolic representation of music called the pianorolls and an audio representation of music called the spectrograms. The model consists of two subnets: the ContourNet, which uses a U-Net structure to learn the correspondence between pianorolls and spectrograms and to give an initial result; and the TextureNet, which further uses a multi-band residual network to refine the result by adding the spectral texture of overtones and timbre. We train the model to generate music clips of the violin, cello, and flute, with a dataset of moderate size. We also present the result of a user study that shows our model achieves higher mean opinion score (MOS) in naturalness and emotional expressivity than a WaveNet-based model and two off-the-shelf synthesizers. We open our source code at https://github.com/bwang514/PerformanceNet

SynthNet: Learning to Synthesize Music End-to-End

Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders

PerformanceNet: Score-to-Audio Music Generation with Multi-Band Convolutional Residual Network

SING: Symbol-to-Instrument Neural Generator

GANSynth: Adversarial Neural Audio Synthesis

Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset

Conditioning Deep Generative Raw Audio Models for Structured Automatic Music

MidiNet: A Convolutional Generative Adversarial Network for Symbolic-domain Music Generation

Can Knowledge of End-to-End Text-to-Speech Models Improve Neural MIDI-to-Audio Synthesis Systems?

A Unified Model for Zero-shot Music Source Separation, Transcription and Synthesis

Demonstration of PerformanceNet: A Convolutional Neural Network Model for Score-to-Audio Music Generation

Generating Sample-Based Musical Instruments Using Neural Audio Codec Language Models

A holistic approach to polyphonic music transcription with neural networks

NAS-FM: Neural Architecture Search for Tunable and Interpretable Sound Synthesis based on Frequency Modulation

Piano automatic transcription based on transformer

StemGen: A music generation model that listens

2019 Formatting Instructions for Authors Using LaTeX

MelNet: A Generative Model for Audio in the Frequency Domain

DDSP-based Neural Waveform Synthesis of Polyphonic Guitar Performance from String-wise MIDI Input

FloWaveNet : A Generative Flow for Raw Audio

From Music Scores to Audio Recordings: Deep Pitch-Class Representations for Measuring Tonal Structures