Abstract:In this paper, we propose a novel unsupervised text-to-speech acoustic model training scheme, named UTTS, which does not require text-audio pairs. UTTS is a multi-speaker speech synthesizer that supports zero-shot voice cloning, it is developed from a perspective of disentangled speech representation learning. The framework offers a flexible choice of a speaker's duration model, timbre feature (identity) and content for TTS inference. We leverage recent advancements in self-supervised speech representation learning as well as speech synthesis front-end techniques for system development. Specifically, we employ our recently formulated Conditional Disentangled Sequential Variational Auto-encoder (C-DSVAE) as the backbone UTTS AM, which offers well-structured content representations given unsupervised alignment (UA) as condition during training. For UTTS inference, we utilize a lexicon to map input text to the phoneme sequence, which is expanded to the frame-level forced alignment (FA) with a speaker-dependent duration model. Then, we develop an alignment mapping module that converts FA to UA. Finally, the C-DSVAE, serving as the self-supervised TTS AM, takes the predicted UA and a target speaker embedding to generate the mel spectrogram, which is ultimately converted to waveform with a neural vocoder. We show how our method enables speech synthesis without using a paired TTS corpus in AM development stage. Experiments demonstrate that UTTS can synthesize speech of high naturalness and intelligibility measured by human and objective evaluations. Audio samples are available at our demo page <a class="link-external link-https" href="https://neurtts.github.io/utts" rel="external noopener nofollow">this https URL</a>\_demo/.

Audio Word2vec: Sequence-to-Sequence Autoencoding for Unsupervised Learning of Audio Segmentation and Representation

Audio Word2Vec: Unsupervised Learning of Audio Segment Representations using Sequence-to-sequence Autoencoder

Segmental Audio Word2Vec: Representing Utterances as Sequences of Vectors with Applications in Spoken Term Detection

Language Transfer of Audio Word2Vec: Learning Audio Segment Representations Without Target Language Data.

Audio Sentiment Analysis by Heterogeneous Signal Features Learned from Utterance-Based Parallel Neural Network.

Multimodal Variational Auto-encoder based Audio-Visual Segmentation

Improved Audio Embeddings by Adjacency-Based Clustering with Applications in Spoken Term Detection

Towards Unsupervised Automatic Speech Recognition Trained by Unaligned Speech and Text only

Unsupervised Speech Representation Learning Using WaveNet Autoencoders

Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech

Unsupervised Acoustic Unit Representation Learning for Voice Conversion using WaveNet Auto-encoders

AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations

Unsupervised TTS Acoustic Modeling for TTS with Conditional Disentangled Sequential VAE

AVSegFormer: Audio-Visual Segmentation with Transformer

Phonetic-and-Semantic Embedding of Spoken Words with Applications in Spoken Content Retrieval

Unsupervised Representation Disentanglement using Cross Domain Features and Adversarial Learning in Variational Autoencoder based Voice Conversion

Wnet: Audio-Guided Video Object Segmentation Via Wavelet-Based Cross- Modal Denoising Networks

Extending Segment Anything Model into Auditory and Temporal Dimensions for Audio-Visual Segmentation

Stepping Stones: A Progressive Training Strategy for Audio-Visual Semantic Segmentation

Unsupervised End-to-End Learning of Discrete Linguistic Units for Voice Conversion

Sentence Embedder Guided Utterance Encoder (SEGUE) for Spoken Language Understanding