Abstract:Neural Transducer (e.g., RNN-T) has been widely used in automatic speech recognition (ASR) due to its capabilities of efficiently modeling monotonic alignments between input and output sequences and naturally supporting streaming inputs. Considering that monotonic alignments are also critical to text to speech (TTS) synthesis and streaming TTS is also an important application scenario, in this work, we explore the possibility of applying Transducer to TTS and more. However, it is challenging because it is difficult to trade off the emission (continuous melspectrogram prediction) probability and transition (ASR Transducer predicts blank token to indicate transition to next input) probability when calculating the output probability lattice in Transducer, and it is not easy to learn the alignments between text and speech through the output probability lattice. We propose SpeechTransducer (Speech-T for short), a Transformer based Transducer model that 1) uses a new forward algorithm to separate the transition prediction from the continuous mel-spectrogram prediction when calculating the output probability lattice, and uses a diagonal constraint in the probability lattice to help the alignment learning; 2) supports both full-sentence or streaming TTS by adjusting the look-ahead context; and 3) further supports both TTS and ASR together for the first time, which enjoys several advantages including fewer parameters as well as streaming synthesis and recognition in a single model. Experiments on LJSpeech datasets demonstrate that Speech-T 1) is more robust than the attention based autoregressive TTS model due to its inherent monotonic alignments between text and speech; 2) naturally supports streaming TTS with good voice quality; and 3) enjoys the benefit of joint modeling TTS and ASR in a single network.

Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition

Synchronous Transformers for End-to-End Speech Recognition.

The Speechtransformer for Large-scale Mandarin Chinese Speech Recognition.

Spike-Triggered Non-Autoregressive Transformer for End-to-End Speech Recognition

Research Status and Prospect of Transformer in Speech Recognition

Attention is All you Need

Efficient Training of Neural Transducer for Speech Recognition

TST: Time-Sparse Transducer for Automatic Speech Recognition

SETransformer: Speech Enhancement Transformer

Self-Attention Transducers for End-to-End Speech Recognition

End-to-End Multi-speaker Speech Recognition with Transformer.

Attention-based Transducer for Online Speech Recognition

One in A Hundred: Selecting the Best Predicted Sequence from Numerous Candidates for Speech Recognition

R-Transformer: Recurrent Neural Network Enhanced Transformer

Speech-T: Transducer for Text to Speech and Beyond

Improving Generalization of Transformer for Speech Recognition with Parallel Schedule Sampling and Relative Positional Embedding

Efficient Long Sequence Modeling Via State Space Augmented Transformer

Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition

Transformer with Bidirectional Decoder for Speech Recognition

Acoustic-to-Word Recognition with Sequence-to-Sequence Models

Deep Recurrent Convolutional Neural Network: Improving Performance For Speech Recognition