Abstract:Recently, there has been increasing progress in end-to-end automatic speech recognition (ASR) architecture, which transcribes speech to text without any pre-trained alignments. One popular end-to-end approach is the hybrid Connectionist Temporal Classification (CTC) and attention (CTC/attention) based ASR architecture. However, how to deploy hybrid CTC/attention systems for online speech recognition is still a non-trivial problem. This article describes our proposed online hybrid CTC/attention end-to-end ASR architecture, which replaces all the offline components of conventional CTC/attention ASR architecture with their corresponding streaming components. Firstly, we propose stable monotonic chunk-wise attention (sMoChA) to stream the conventional global attention, and further propose monotonic truncated attention (MTA) to simplify sMoChA and solve the training-and-decoding mismatch problem of sMoChA. Secondly, we propose truncated CTC (T-CTC) prefix score to stream CTC prefix score calculation. Thirdly, we design dynamic waiting joint decoding (DWJD) algorithm to dynamically collect the predictions of CTC and attention in an online manner. Finally, we use latency-controlled bidirectional long short-term memory (LC-BLSTM) to stream the widely-used offline bidirectional encoder network. Experiments with LibriSpeech English and HKUST Mandarin tasks demonstrate that, compared with the offline CTC/attention model, our proposed online CTC/attention model improves the real time factor in human-computer interaction services and maintains its performance with moderate degradation. To the best of our knowledge, this is the first work to provide the full-stack online solution for CTC/attention end-to-end ASR architecture.

Improving Attention-Based End-to-End Speech Recognition by Monotonic Alignment Attention Matrix Reconstruction.

Monotonic Gaussian regularization of attention for robust automatic speech recognition

Improving End-to-End Single-Channel Multi-Talker Speech Recognition.

Optimizing Alignment of Speech and Language Latent Spaces for End-To-End Speech Recognition and Understanding.

An improved hybrid CTC-Attention model for speech recognition

Adaptive Sparse and Monotonic Attention for Transformer-based Automatic Speech Recognition

An Online Attention-based Model for Speech Recognition

EffectiveASR: A Single-Step Non-Autoregressive Mandarin Speech Recognition Architecture with High Accuracy and Inference Speed

Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment

A Neural Time Alignment Module for End-to-End Automatic Speech Recognition

Neufa: neural network based end-to-end forced alignment with bidirectional attention mechanism

Efficient Decoding Self-Attention for End-to-end Speech Synthesis

Weak Alignment Supervision from Hybrid Model Improves End-to-end ASR

Structured Sparse Attention for End-to-end Automatic Speech Recognition.

One TTS Alignment To Rule Them All

Enhancing CTC-based speech recognition with diverse modeling units

Online Hybrid CTC/Attention End-to-End Automatic Speech Recognition Architecture

Efficient Monotonic Multihead Attention

Improving Joint Speech-Text Representations Without Alignment

Improving Mandarin End-to-End Speech Recognition with Word N-gram Language Model

Enhancing Monotonicity for Robust Autoregressive Transformer TTS