Abstract:Many speech technologies contain speech generating stage, such as text-to-speech (TTS), voice conversion (VC), speech enhancement (SE). Recent advances in deep learning based methods significantly improve the performance of these technologies [1, 2, 3, 4, 5, 6, 7, 8]. So far, even though various successful deep learning based speech processing methods have been proposed, most of the systems can achieve only one task. For each problem, the network architecture is designed for the targeted task only and involves a long period of tuning specifically for the problem. This procedure needs to be repeated for different tasks, and this restrict the powerful effect of the neural network. The question is can we create a unified deep learning model to solve tasks cross multiple speech technologies. We see that theoretical differences between these technologies are currently becoming much smaller than their original narrow definitions. To give a few examples, the recent advanced high-performance VC systems gain from the use of the phone posteriorgram (that is, a continuous phone representation) of inputted speech [9]. There was also an attempt to use both the spectrum features and phone posteriorgram to further improve the performance of voice conversion [4]. We can also see similar trends for TTS. The end-to-end TTS system sometimes also uses phone-embedding vectors as the input instead of letter inputs [3, 10]. There was also an attempt to use a reference audio signal as the additional input for Tacotron to transfer the prosody of the reference audio into synthetic speech via a reference encoder [11]. Given the above trends, we strongly believe that we can construct one model shared for multi-task. We assume that the speech generation related tasks can be divided into two parts: an input encoder and an acoustic decoder. The difference among the different tasks is the input. For example, the input of TTS is text characters while that of VC and SE is acoustic features. The model can be thought of as an encoder-decoder model that supports multiple encoders. The role of multiple encoder networks is the frond-end processing of each type of input data and the role of a decoder network is to predict acoustic features required for waveform generation. Our initial work starts with the joint training model for TTS&VC [12].

Understanding Shared Speech-Text Representations

Few-Shot Spoken Language Understanding via Joint Speech-Text Models

MAESTRO: Matched Speech Text Representations Through Modality Matching

Learning Shared Semantic Space for Speech-to-Text Translation

Shared model for multi-source speech generation tasks

SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training

Improving Joint Speech-Text Representations Without Alignment

Pre-Trained Acoustic-and-Textual Modeling for End-To-End Speech-To-Text Translation.

Maestro-U: Leveraging joint speech-text representation learning for zero supervised speech ASR

Optimizing Alignment of Speech and Language Latent Spaces for End-To-End Speech Recognition and Understanding.

Speech-Text Based Multi-Modal Training with Bidirectional Attention for Improved Speech Recognition

Joint Speech-Text Embeddings for Multitask Speech Processing

Unified Speech-Text Pre-training for Speech Translation and Recognition

A General Multi-Task Learning Framework to Leverage Text Data for Speech to Text Tasks

Bridging Speech and Textual Pre-trained Models with Unsupervised ASR.

USTED: Improving ASR with a Unified Speech and Text Encoder-Decoder

DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment

Speech Representation Analysis based on Inter- and Intra-Model Similarities

An End-to-End Speech Recognition System Based on Shared Encoder

An Analysis of Semantically-Aligned Speech-Text Embeddings

Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study