Abstract:End-to-end models are favored in automatic speech recognition (ASR) because of its simplified system structure and superior performance. Among these models, recurrent neural network transducer (RNN-T) has achieved significant progress in streaming on-device speech recognition because of its high-accuracy and low-latency. RNN-T adopts a prediction network to enhance language information, but its language modeling ability is limited because it still needs paired speech-text data to train. Further strengthening the language modeling ability through extra text data, such as shallow fusion with an external language model, only brings a small performance gain. In view of the fact that Mandarin Chinese is a character-based language and each character is pronounced as a tonal syllable, this paper proposes a novel cascade RNN-T approach to improve the language modeling ability of RNN-T. Our approach firstly uses an RNN-T to transform acoustic feature into syllable sequence, and then converts the syllable sequence into character sequence through an RNN-T-based syllable-to-character converter. Thus a rich text repository can be easily used to strengthen the language model ability. By introducing several important tricks, the cascade RNN-T approach surpasses the character-based RNN-T by a large margin on several Mandarin test sets, with much higher recognition quality and similar latency.

Multistate Encoding with End-To-End Speech RNN Transducer Network

Improving RNN Transducer Based ASR with Auxiliary Tasks

TST: Time-Sparse Transducer for Automatic Speech Recognition

Attention-based Transducer for Online Speech Recognition

Towards Fast and Accurate Streaming End-To-End ASR

Non-Autoregressive End-To-End Automatic Speech Recognition Incorporating Downstream Natural Language Processing

Exploring RNN-Transducer for Chinese Speech Recognition

End-to-end Contextual Speech Recognition Using Class Language Models and a Token Passing Decoder.

Massive End-to-end Models for Short Search Queries

End-to-End Multi-speaker Speech Recognition with Transformer.

EESEN: End-to-End Speech Recognition using Deep RNN Models and WFST-based Decoding

Improved Neural Language Model Fusion for Streaming Recurrent Neural Network Transducer

ConvRNN-T: Convolutional Augmented Recurrent Neural Network Transducers for Streaming Speech Recognition

Improving RNN transducer with normalized jointer network

Cascade RNN-Transducer: Syllable Based Streaming On-device Mandarin Speech Recognition with a Syllable-to-Character Converter

Self-Attention Transducers for End-to-End Speech Recognition

Developing RNN-T Models Surpassing High-Performance Hybrid Models with Customization Capability.

Enhancing CTC-based speech recognition with diverse modeling units

Multitask Learning and Joint Optimization for Transformer-RNN-Transducer Speech Recognition

Context-Aware end-to-end ASR Using Self-Attentive Embedding and Tensor Fusion

Efficient minimum word error rate training of RNN-Transducer for end-to-end speech recognition