Abstract:Timbre spaces have been used in music perception to study the perceptual relationships between instruments based on dissimilarity ratings. However, these spaces do not generalize to novel examples and do not provide an invertible mapping, preventing audio synthesis. In parallel, generative models have aimed to provide methods for synthesizing novel timbres. However, these systems do not provide an understanding of their inner workings and are usually not related to any perceptually relevant information. Here, we show that Variational Auto-Encoders (VAE) can alleviate all of these limitations by constructing generative timbre spaces. To do so, we adapt VAEs to learn an audio latent space, while using perceptual ratings from timbre studies to regularize the organization of this space. The resulting space allows us to analyze novel instruments, while being able to synthesize audio from any point of this space. We introduce a specific regularization allowing to enforce any given similarity distances onto these spaces. We show that the resulting space provide almost similar distance relationships as timbre spaces. We evaluate several spectral transforms and show that the Non-Stationary Gabor Transform (NSGT) provides the highest correlation to timbre spaces and the best quality of synthesis. Furthermore, we show that these spaces can generalize to novel instruments and can generate any path between instruments to understand their timbre relationships. As these spaces are continuous, we study how audio descriptors behave along the latent dimensions. We show that even though descriptors have an overall non-linear topology, they follow a locally smooth evolution. Based on this, we introduce a method for descriptor-based synthesis and show that we can control the descriptors of an instrument while keeping its timbre structure.

Introducing Latent Timbre Synthesis

Sound Design Strategies for Latent Audio Space Explorations Using Deep Learning Architectures

Interpretable Timbre Synthesis using Variational Autoencoders Regularized on Timbre Descriptors

Combining audio control and style transfer using latent diffusion

Vector-Quantized Timbre Representation

Hierarchical Timbre-Painting and Articulation Generation

Timbre Transfer with Variational Auto Encoding and Cycle-Consistent Adversarial Networks

Conditioning Autoencoder Latent Spaces for Real-Time Timbre Interpolation and Synthesis

Generative timbre spaces: regularizing variational auto-encoders with perceptual metrics

Vocal Timbre Effects with Differentiable Digital Signal Processing

Hyperbolic Timbre Embedding for Musical Instrument Sound Synthesis Based on Variational Autoencoders

Learning Disentangled Representations of Timbre and Pitch for Musical Instrument Sounds Using Gaussian Mixture Variational Autoencoders

Latent Diffusion Bridges for Unsupervised Musical Audio Timbre Transfer

Audio Latent Space Cartography

Real-time Timbre Remapping with Differentiable DSP

Bass Accompaniment Generation via Latent Diffusion

On the Semantic Latent Space of Diffusion-Based Text-to-Speech Models

Flat Latent Manifolds for Human-machine Co-creation of Music

Long-form music generation with latent diffusion

User-Driven Voice Generation and Editing through Latent Space Navigation

Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models