Abstract:Can we develop a model that can synthesize realistic speech directly from a latent space, without explicit conditioning? Despite several efforts over the last decade, previous adversarial and diffusion-based approaches still struggle to achieve this, even on small-vocabulary datasets. To address this, we propose AudioStyleGAN (ASGAN) a generative adversarial network for unconditional speech synthesis tailored to learn a disentangled latent space. Building upon the StyleGAN family of image synthesis models, ASGAN maps sampled noise to a disentangled latent vector which is then mapped to a sequence of audio features so that signal aliasing is suppressed at every layer. To successfully train ASGAN, we introduce a number of new techniques, including a modification to adaptive discriminator augmentation which probabilistically skips discriminator updates. We apply it on the small-vocabulary Google Speech Commands digits dataset, where it achieves state-of-the-art results in unconditional speech synthesis. It is also substantially faster than existing top-performing diffusion models. We confirm that ASGAN's latent space is disentangled: we demonstrate how simple linear operations in the space can be used to perform several tasks unseen during training. Specifically, we perform evaluations in voice conversion, speech enhancement, speaker verification, and keyword classification. Our work indicates that GANs are still highly competitive in the unconditional speech synthesis landscape, and that disentangled latent spaces can be used to aid generalization to unseen tasks.

Learning Disentangled Audio Representations through Controlled Synthesis

Synthia's Melody: A Benchmark Framework for Unsupervised Domain Adaptation in Audio

Contrastive Learning from Synthetic Audio Doppelgangers

dMelodies: A Music Dataset for Disentanglement Learning

DisMix: Disentangling Mixtures of Musical Instruments for Source-level Pitch and Timbre Manipulation

Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data

Synt++: Utilizing Imperfect Synthetic Data to Improve Speech Recognition

DSVAE: Interpretable Disentangled Representation for Synthetic Speech Detection

A Unified Model for Zero-shot Music Source Separation, Transcription and Synthesis

Learning Disentangled Representations of Timbre and Pitch for Musical Instrument Sounds Using Gaussian Mixture Variational Autoencoders

Disentanglement in a GAN for Unconditional Speech Synthesis

Learning Disentangled Speech Representations with Contrastive Learning and Time-Invariant Retrieval

Deep Spectro-temporal Artifacts for Detecting Synthesized Speech.

Can Synthetic Audio From Generative Foundation Models Assist Audio Recognition and Speech Modeling?

EAD-VC: Enhancing Speech Auto-Disentanglement for Voice Conversion with IFUB Estimator and Joint Text-Guided Consistent Learning

Polyphonic training set synthesis improves self-supervised urban sound classification

Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion.

SynthSOD: Developing an Heterogeneous Dataset for Orchestra Music Source Separation

Compositional Audio Representation Learning

ToneUnit: A Speech Discretization Approach for Tonal Language Speech Synthesis