Abstract:Neural Transducer (e.g., RNN-T) has been widely used in automatic speech recognition (ASR) due to its capabilities of efficiently modeling monotonic alignments between input and output sequences and naturally supporting streaming inputs. Considering that monotonic alignments are also critical to text to speech (TTS) synthesis and streaming TTS is also an important application scenario, in this work, we explore the possibility of applying Transducer to TTS and more. However, it is challenging because it is difficult to trade off the emission (continuous melspectrogram prediction) probability and transition (ASR Transducer predicts blank token to indicate transition to next input) probability when calculating the output probability lattice in Transducer, and it is not easy to learn the alignments between text and speech through the output probability lattice. We propose SpeechTransducer (Speech-T for short), a Transformer based Transducer model that 1) uses a new forward algorithm to separate the transition prediction from the continuous mel-spectrogram prediction when calculating the output probability lattice, and uses a diagonal constraint in the probability lattice to help the alignment learning; 2) supports both full-sentence or streaming TTS by adjusting the look-ahead context; and 3) further supports both TTS and ASR together for the first time, which enjoys several advantages including fewer parameters as well as streaming synthesis and recognition in a single model. Experiments on LJSpeech datasets demonstrate that Speech-T 1) is more robust than the attention based autoregressive TTS model due to its inherent monotonic alignments between text and speech; 2) naturally supports streaming TTS with good voice quality; and 3) enjoys the benefit of joint modeling TTS and ASR in a single network.

Label-Synchronous Neural Transducer for End-to-End ASR

Label-Synchronous Neural Transducer for Adaptable Online E2E Speech Recognition

Label-Synchronous Neural Transducer for E2E Simultaneous Speech Translation

Speech-T: Transducer for Text to Speech and Beyond

LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and Translation Using Neural Transducers

Improved Neural Language Model Fusion for Streaming Recurrent Neural Network Transducer

Decoupled Structure for Improved Adaptability of End-to-End Models

t-SOT FNT: Streaming Multi-talker ASR with Text-only Domain Adaptation Capability

Attention-based Transducer for Online Speech Recognition

TST: Time-Sparse Transducer for Automatic Speech Recognition

Multi-blank Transducers for Speech Recognition

Transformer-Transducers for Code-Switched Speech Recognition

A CTC Alignment-based Non-autoregressive Transformer for End-to-end Automatic Speech Recognition

Accelerating Transducers through Adjacent Token Merging

Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token-based ASR

A Lexical-aware Non-autoregressive Transformer-based ASR Model

Transduce and Speak: Neural Transducer for Text-to-Speech with Semantic Token Prediction

Lookahead When It Matters: Adaptive Non-causal Transformers for Streaming Neural Transducers

Improved Factorized Neural Transducer Model For text-only Domain Adaptation

Alignment Restricted Streaming Recurrent Neural Network Transducer.

Self-Attention Transducers for End-to-End Speech Recognition