Abstract:Many speech technologies contain speech generating stage, such as text-to-speech (TTS), voice conversion (VC), speech enhancement (SE). Recent advances in deep learning based methods significantly improve the performance of these technologies [1, 2, 3, 4, 5, 6, 7, 8]. So far, even though various successful deep learning based speech processing methods have been proposed, most of the systems can achieve only one task. For each problem, the network architecture is designed for the targeted task only and involves a long period of tuning specifically for the problem. This procedure needs to be repeated for different tasks, and this restrict the powerful effect of the neural network. The question is can we create a unified deep learning model to solve tasks cross multiple speech technologies. We see that theoretical differences between these technologies are currently becoming much smaller than their original narrow definitions. To give a few examples, the recent advanced high-performance VC systems gain from the use of the phone posteriorgram (that is, a continuous phone representation) of inputted speech [9]. There was also an attempt to use both the spectrum features and phone posteriorgram to further improve the performance of voice conversion [4]. We can also see similar trends for TTS. The end-to-end TTS system sometimes also uses phone-embedding vectors as the input instead of letter inputs [3, 10]. There was also an attempt to use a reference audio signal as the additional input for Tacotron to transfer the prosody of the reference audio into synthetic speech via a reference encoder [11]. Given the above trends, we strongly believe that we can construct one model shared for multi-task. We assume that the speech generation related tasks can be divided into two parts: an input encoder and an acoustic decoder. The difference among the different tasks is the input. For example, the input of TTS is text characters while that of VC and SE is acoustic features. The model can be thought of as an encoder-decoder model that supports multiple encoders. The role of multiple encoder networks is the frond-end processing of each type of input data and the role of a decoder network is to predict acoustic features required for waveform generation. Our initial work starts with the joint training model for TTS&VC [12].

Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis

ViSPer: A Multilingual TTS Approach Based on VITS Using Deep Feature Loss

AudioVSR: Enhancing Video Speech Recognition with Audio Data

VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge transfer from voice conversion

SVTS: Scalable Video-to-Speech Synthesis

VinTAGe: Joint Video and Text Conditioning for Holistic Audio Generation

Visual-Aware Text-to-Speech

Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer

Vec-Tok Speech: speech vectorization and tokenization for neural speech generation

Tell What You Hear From What You See -- Video to Audio Generation Through Text

Audio-visual speech synthesis using vision transformer–enhanced autoencoders with ensemble of loss functions

E-ViLM: Efficient Video-Language Model via Masked Video Modeling with Semantic Vector-Quantized Tokenizer

Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild

From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation

Audiovisual Speech Synthesis using Tacotron2

i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data

Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks

Shared model for multi-source speech generation tasks

Enhancing Sentence Representation with Visually-supervised Multimodal Pre-training

VoxInstruct: Expressive Human Instruction-to-Speech Generation with Unified Multilingual Codec Language Modelling

LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders