Shared model for multi-source speech generation tasks

Mingyang Zhang
2019-01-01
Abstract:Many speech technologies contain speech generating stage, such as text-to-speech (TTS), voice conversion (VC), speech enhancement (SE). Recent advances in deep learning based methods significantly improve the performance of these technologies [1, 2, 3, 4, 5, 6, 7, 8]. So far, even though various successful deep learning based speech processing methods have been proposed, most of the systems can achieve only one task. For each problem, the network architecture is designed for the targeted task only and involves a long period of tuning specifically for the problem. This procedure needs to be repeated for different tasks, and this restrict the powerful effect of the neural network. The question is can we create a unified deep learning model to solve tasks cross multiple speech technologies. We see that theoretical differences between these technologies are currently becoming much smaller than their original narrow definitions. To give a few examples, the recent advanced high-performance VC systems gain from the use of the phone posteriorgram (that is, a continuous phone representation) of inputted speech [9]. There was also an attempt to use both the spectrum features and phone posteriorgram to further improve the performance of voice conversion [4]. We can also see similar trends for TTS. The end-to-end TTS system sometimes also uses phone-embedding vectors as the input instead of letter inputs [3, 10]. There was also an attempt to use a reference audio signal as the additional input for Tacotron to transfer the prosody of the reference audio into synthetic speech via a reference encoder [11]. Given the above trends, we strongly believe that we can construct one model shared for multi-task. We assume that the speech generation related tasks can be divided into two parts: an input encoder and an acoustic decoder. The difference among the different tasks is the input. For example, the input of TTS is text characters while that of VC and SE is acoustic features. The model can be thought of as an encoder-decoder model that supports multiple encoders. The role of multiple encoder networks is the frond-end processing of each type of input data and the role of a decoder network is to predict acoustic features required for waveform generation. Our initial work starts with the joint training model for TTS&VC [12].
What problem does this paper attempt to address?