Abstract:Neural Transducer (e.g., RNN-T) has been widely used in automatic speech recognition (ASR) due to its capabilities of efficiently modeling monotonic alignments between input and output sequences and naturally supporting streaming inputs. Considering that monotonic alignments are also critical to text to speech (TTS) synthesis and streaming TTS is also an important application scenario, in this work, we explore the possibility of applying Transducer to TTS and more. However, it is challenging because it is difficult to trade off the emission (continuous melspectrogram prediction) probability and transition (ASR Transducer predicts blank token to indicate transition to next input) probability when calculating the output probability lattice in Transducer, and it is not easy to learn the alignments between text and speech through the output probability lattice. We propose SpeechTransducer (Speech-T for short), a Transformer based Transducer model that 1) uses a new forward algorithm to separate the transition prediction from the continuous mel-spectrogram prediction when calculating the output probability lattice, and uses a diagonal constraint in the probability lattice to help the alignment learning; 2) supports both full-sentence or streaming TTS by adjusting the look-ahead context; and 3) further supports both TTS and ASR together for the first time, which enjoys several advantages including fewer parameters as well as streaming synthesis and recognition in a single model. Experiments on LJSpeech datasets demonstrate that Speech-T 1) is more robust than the attention based autoregressive TTS model due to its inherent monotonic alignments between text and speech; 2) naturally supports streaming TTS with good voice quality; and 3) enjoys the benefit of joint modeling TTS and ASR in a single network.

A unified sequence-to-sequence front-end model for Mandarin text-to-speech synthesis

A unified front-end framework for English text-to-speech synthesis

Unified Mandarin TTS Front-end Based on Distilled BERT Model

End-to-end Code-switched TTS with Mix of Monolingual Recordings.

A Novel Chinese Dialect TTS Frontend with Non-Autoregressive Neural Machine Translation

An Unified and Automatic Approach of Mandarin HTS System.

Knowledge-based Linguistic Encoding for End-to-End Mandarin Text-to-Speech Synthesis

Mandarin Text-to-Speech Front-End with Lightweight Distilled Convolution Network

Improving Prosody with Linguistic and Bert Derived Features in Multi-Speaker Based Mandarin Chinese Neural TTS

A Novel Hybrid Mandarin Speech Synthesis System Using Different Base Units for Model Training and Concatenation

A Novel Prosody Adaptation Method for Mandarin Concatenation-Based Text-to-speech System

Text Enhancement for Paragraph Processing in End-to-End Code-switching TTS

Scalable Multilingual Frontend for TTS

A Preliminary Study on Deep Learning-based Chinese Text to Taiwanese Speech Synthesis System

Automatic Conversion from Lexical Words to Prosodic Words for Mandarin Text-to-speech System

UnifySpeech: A Unified Framework for Zero-shot Text-to-Speech and Voice Conversion

A Unified Framework for Multilingual Text-to-speech Synthesis with SSML Specification As Interface

Non-Autoregressive End-to-End TTS with Coarse-to-Fine Decoding

Speech-T: Transducer for Text to Speech and Beyond

Prior-agnostic Multi-scale Contrastive Text-Audio Pre-training for Parallelized TTS Frontend Modeling

Cross-lingual Multi-speaker Text-to-speech Synthesis for Voice Cloning without Using Parallel Corpus for Unseen Speakers