Abstract:Many speech technologies contain speech generating stage, such as text-to-speech (TTS), voice conversion (VC), speech enhancement (SE). Recent advances in deep learning based methods significantly improve the performance of these technologies [1, 2, 3, 4, 5, 6, 7, 8]. So far, even though various successful deep learning based speech processing methods have been proposed, most of the systems can achieve only one task. For each problem, the network architecture is designed for the targeted task only and involves a long period of tuning specifically for the problem. This procedure needs to be repeated for different tasks, and this restrict the powerful effect of the neural network. The question is can we create a unified deep learning model to solve tasks cross multiple speech technologies. We see that theoretical differences between these technologies are currently becoming much smaller than their original narrow definitions. To give a few examples, the recent advanced high-performance VC systems gain from the use of the phone posteriorgram (that is, a continuous phone representation) of inputted speech [9]. There was also an attempt to use both the spectrum features and phone posteriorgram to further improve the performance of voice conversion [4]. We can also see similar trends for TTS. The end-to-end TTS system sometimes also uses phone-embedding vectors as the input instead of letter inputs [3, 10]. There was also an attempt to use a reference audio signal as the additional input for Tacotron to transfer the prosody of the reference audio into synthetic speech via a reference encoder [11]. Given the above trends, we strongly believe that we can construct one model shared for multi-task. We assume that the speech generation related tasks can be divided into two parts: an input encoder and an acoustic decoder. The difference among the different tasks is the input. For example, the input of TTS is text characters while that of VC and SE is acoustic features. The model can be thought of as an encoder-decoder model that supports multiple encoders. The role of multiple encoder networks is the frond-end processing of each type of input data and the role of a decoder network is to predict acoustic features required for waveform generation. Our initial work starts with the joint training model for TTS&VC [12].

Shared model for multi-source speech generation tasks

End-to-end Code-switched TTS with Mix of Monolingual Recordings.

Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models.

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

Enhancing CTC-based speech recognition with diverse modeling units

SpeechNet: A Universal Modularized Model for Speech Processing Tasks

Transfer Learning from Speech Synthesis to Voice Conversion with Non-Parallel Training Data

UnifySpeech: A Unified Framework for Zero-shot Text-to-Speech and Voice Conversion

UniSyn: An End-to-End Unified Model for Text-to-Speech and Singing Voice Synthesis

Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis

MASS: Multi-task anthropomorphic speech synthesis framework

Understanding Shared Speech-Text Representations

VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge transfer from voice conversion

STTATTS: Unified Speech-To-Text And Text-To-Speech Model

Building Multi lingual TTS using Cross Lingual Voice Conversion

Shared Network for Speech Enhancement Based on Multi-Task Learning.

Hierarchical Generative Modeling for Controllable Speech Synthesis

Non-Autoregressive Fully Parallel Deep Convolutional Neural Speech Synthesis

Multi-Target Emotional Voice Conversion With Neural Vocoders

Joint Speech-Text Embeddings for Multitask Speech Processing

Training Multi-Speaker Neural Text-to-Speech Systems using Speaker-Imbalanced Speech Corpora