Abstract:Recently, there has been increasing progress in end-to-end automatic speech recognition (ASR) architecture, which transcribes speech to text without any pre-trained alignments. One popular end-to-end approach is the hybrid Connectionist Temporal Classification (CTC) and attention (CTC/attention) based ASR architecture. However, how to deploy hybrid CTC/attention systems for online speech recognition is still a non-trivial problem. This article describes our proposed online hybrid CTC/attention end-to-end ASR architecture, which replaces all the offline components of conventional CTC/attention ASR architecture with their corresponding streaming components. Firstly, we propose stable monotonic chunk-wise attention (sMoChA) to stream the conventional global attention, and further propose monotonic truncated attention (MTA) to simplify sMoChA and solve the training-and-decoding mismatch problem of sMoChA. Secondly, we propose truncated CTC (T-CTC) prefix score to stream CTC prefix score calculation. Thirdly, we design dynamic waiting joint decoding (DWJD) algorithm to dynamically collect the predictions of CTC and attention in an online manner. Finally, we use latency-controlled bidirectional long short-term memory (LC-BLSTM) to stream the widely-used offline bidirectional encoder network. Experiments with LibriSpeech English and HKUST Mandarin tasks demonstrate that, compared with the offline CTC/attention model, our proposed online CTC/attention model improves the real time factor in human-computer interaction services and maintains its performance with moderate degradation. To the best of our knowledge, this is the first work to provide the full-stack online solution for CTC/attention end-to-end ASR architecture.

CRF-based Single-stage Acoustic Modeling with CTC Topology.

Acoustic Modeling With Dfsmn-Ctc And Joint Ctc-Ce Learning

CAT: CRF-based ASR Toolkit

CR-CTC: Consistency regularization on CTC for improved speech recognition

Residual Convolutional CTC Networks for Automatic Speech Recognition.

Advancing CTC-CRF Based End-to-End Speech Recognition with Wordpieces and Conformers

Advancing Acoustic-to-Word CTC Model

Comparison of Decoding Strategies for CTC Acoustic Models

LV-CTC: Non-autoregressive ASR with CTC and latent variable models

Investigation of Modeling Units for Mandarin Speech Recognition Using Dfsmn-ctc-smbr

Attention-Based Gated Scaling Adaptive Acoustic Model for CTC-Based Speech Recognition.

Exploiting Prosodic and Lexical Features for Tone Modeling in A Conditional Random Field Framework

CTC Regularized Model Adaptation for Improving LSTM RNN Based Multi-Accent Mandarin Speech Recognition

CIF-T: A Novel CIF-based Transducer Architecture for Automatic Speech Recognition

Cross-modal Alignment with Optimal Transport for CTC-based ASR

Hierarchical Cross-Modality Knowledge Transfer with Sinkhorn Attention for CTC-based ASR

CACnet: Cube Attentional CNN for Automatic Speech Recognition

Enhancing CTC-based speech recognition with diverse modeling units

CRF-based confidence measures of recognized candidates for lattice-based audio indexing

CAT: A CTC-CRF Based ASR Toolkit Bridging the Hybrid and the End-to-end Approaches Towards Data Efficiency and Low Latency

Online Hybrid CTC/Attention End-to-End Automatic Speech Recognition Architecture