Abstract:Timbre spaces have been used in music perception to study the perceptual relationships between instruments based on dissimilarity ratings. However, these spaces do not generalize to novel examples and do not provide an invertible mapping, preventing audio synthesis. In parallel, generative models have aimed to provide methods for synthesizing novel timbres. However, these systems do not provide an understanding of their inner workings and are usually not related to any perceptually relevant information. Here, we show that Variational Auto-Encoders (VAE) can alleviate all of these limitations by constructing generative timbre spaces. To do so, we adapt VAEs to learn an audio latent space, while using perceptual ratings from timbre studies to regularize the organization of this space. The resulting space allows us to analyze novel instruments, while being able to synthesize audio from any point of this space. We introduce a specific regularization allowing to enforce any given similarity distances onto these spaces. We show that the resulting space provide almost similar distance relationships as timbre spaces. We evaluate several spectral transforms and show that the Non-Stationary Gabor Transform (NSGT) provides the highest correlation to timbre spaces and the best quality of synthesis. Furthermore, we show that these spaces can generalize to novel instruments and can generate any path between instruments to understand their timbre relationships. As these spaces are continuous, we study how audio descriptors behave along the latent dimensions. We show that even though descriptors have an overall non-linear topology, they follow a locally smooth evolution. Based on this, we introduce a method for descriptor-based synthesis and show that we can control the descriptors of an instrument while keeping its timbre structure.

Generative Modelling for Controllable Audio Synthesis of Expressive Piano Performance

Expressive MIDI-format Piano Performance Generation

MIDI-VAE: Modeling Dynamics and Instrumentation of Music with Applications to Style Transfer

Exploring how a Generative AI interprets music

Exploring Variational Auto-Encoder Architectures, Configurations, and Datasets for Generative Music Explainable AI

Sine, Transient, Noise Neural Modeling of Piano Notes

Hierarchical Generative Modeling for Controllable Speech Synthesis

Flat Latent Manifolds for Human-machine Co-creation of Music

Interpretable Timbre Synthesis using Variational Autoencoders Regularized on Timbre Descriptors

Deep generative models for musical audio synthesis

Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders

MIDI-Sandwich: Multi-model Multi-task Hierarchical Conditional VAE-GAN networks for Symbolic Single-track Music Generation

Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset

Learning Disentangled Representations of Timbre and Pitch for Musical Instrument Sounds Using Gaussian Mixture Variational Autoencoders

Sketching the Expression: Flexible Rendering of Expressive Piano Performance with Self-Supervised Learning

A Sentiment-Controllable Music Generation System Based on Conditional Variational Autoencoder

The Piano Inpainting Application

An Intelligent Music Generation Based on Variational Autoencoder

Generative timbre spaces: regularizing variational auto-encoders with perceptual metrics

Unconditional Audio Generation with Generative Adversarial Networks and Cycle Regularization

MG-VAE: Deep Chinese Folk Songs Generation with Specific Regional Style