Abstract:Neural Transducer (e.g., RNN-T) has been widely used in automatic speech recognition (ASR) due to its capabilities of efficiently modeling monotonic alignments between input and output sequences and naturally supporting streaming inputs. Considering that monotonic alignments are also critical to text to speech (TTS) synthesis and streaming TTS is also an important application scenario, in this work, we explore the possibility of applying Transducer to TTS and more. However, it is challenging because it is difficult to trade off the emission (continuous melspectrogram prediction) probability and transition (ASR Transducer predicts blank token to indicate transition to next input) probability when calculating the output probability lattice in Transducer, and it is not easy to learn the alignments between text and speech through the output probability lattice. We propose SpeechTransducer (Speech-T for short), a Transformer based Transducer model that 1) uses a new forward algorithm to separate the transition prediction from the continuous mel-spectrogram prediction when calculating the output probability lattice, and uses a diagonal constraint in the probability lattice to help the alignment learning; 2) supports both full-sentence or streaming TTS by adjusting the look-ahead context; and 3) further supports both TTS and ASR together for the first time, which enjoys several advantages including fewer parameters as well as streaming synthesis and recognition in a single model. Experiments on LJSpeech datasets demonstrate that Speech-T 1) is more robust than the attention based autoregressive TTS model due to its inherent monotonic alignments between text and speech; 2) naturally supports streaming TTS with good voice quality; and 3) enjoys the benefit of joint modeling TTS and ASR in a single network.

Transformer-PSS: A High-Efficiency Prosodic Speech Synthesis Model based on Transformer

Neural Speech Synthesis with Transformer Network.

FastSpeech: Fast, Robust and Controllable Text to Speech

Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer

A Transformer-based Chinese Non-autoregressive Speech Synthesis Scheme

Transformer-S2A: Robust and Efficient Speech-to-Animation.

Close to Human Quality TTS with Transformer.

SETransformer: Speech Enhancement Transformer

PSST! Prosodic Speech Segmentation with Transformers

A Mel Spectrogram Enhancement Paradigm Based on CWT in Speech Synthesis

Msdtron: a high-capability multi-speaker speech synthesis system for diverse data using characteristic information

SpeechFormer++: A Hierarchical Efficient Framework for Paralinguistic Speech Processing

Patnet : A Phoneme-Level Autoregressive Transformer Network for Speech Synthesis.

Speech-T: Transducer for Text to Speech and Beyond

Improving Generalization of Transformer for Speech Recognition with Parallel Schedule Sampling and Relative Positional Embedding

Sim-T: Simplify the Transformer Network by Multiplexing Technique for Speech Recognition

RobuTrans: A Robust Transformer-Based Text-to-Speech Model

PMVC: Data Augmentation-Based Prosody Modeling for Expressive Voice Conversion

FastGraphTTS: An Ultrafast Syntax-Aware Speech Synthesis Framework

DPATD: Dual-Phase Audio Transformer for Denoising

Automatic Conversion from Lexical Words to Prosodic Words for Mandarin Text-to-speech System