Abstract:Recently, there has been increasing progress in end-to-end automatic speech recognition (ASR) architecture, which transcribes speech to text without any pre-trained alignments. One popular end-to-end approach is the hybrid Connectionist Temporal Classification (CTC) and attention (CTC/attention) based ASR architecture. However, how to deploy hybrid CTC/attention systems for online speech recognition is still a non-trivial problem. This article describes our proposed online hybrid CTC/attention end-to-end ASR architecture, which replaces all the offline components of conventional CTC/attention ASR architecture with their corresponding streaming components. Firstly, we propose stable monotonic chunk-wise attention (sMoChA) to stream the conventional global attention, and further propose monotonic truncated attention (MTA) to simplify sMoChA and solve the training-and-decoding mismatch problem of sMoChA. Secondly, we propose truncated CTC (T-CTC) prefix score to stream CTC prefix score calculation. Thirdly, we design dynamic waiting joint decoding (DWJD) algorithm to dynamically collect the predictions of CTC and attention in an online manner. Finally, we use latency-controlled bidirectional long short-term memory (LC-BLSTM) to stream the widely-used offline bidirectional encoder network. Experiments with LibriSpeech English and HKUST Mandarin tasks demonstrate that, compared with the offline CTC/attention model, our proposed online CTC/attention model improves the real time factor in human-computer interaction services and maintains its performance with moderate degradation. To the best of our knowledge, this is the first work to provide the full-stack online solution for CTC/attention end-to-end ASR architecture.

Enhancing Monotonic Multihead Attention for Streaming ASR

Mutually-Constrained Monotonic Multihead Attention for Online ASR

Adaptive Sparse and Monotonic Attention for Transformer-based Automatic Speech Recognition

Learning Monotonic Attention in Transducer for Streaming Generation

Streaming Chunk-Aware Multihead Attention for Online End-to-End Speech Recognition

Stream Attention For Distributed Multi-Microphone Speech Recognition

Efficient Monotonic Multihead Attention

Streaming Audio-Visual Speech Recognition with Alignment Regularization

STREAM ATTENTION-BASED MULTI-ARRAY END-TO-END SPEECH RECOGNITION

Improving Multi-Speaker ASR With Overlap-Aware Encoding And Monotonic Attention.

Monotonic segmental attention for automatic speech recognition

Mamba for Streaming ASR Combined with Unimodal Aggregation

Self-regularised Minimum Latency Training for Streaming Transformer-based Speech Recognition

MA-Stereo: Real-Time Stereo Matching Via Multi-Scale Attention Fusion and Spatial Error-Aware Refinement

Mask-CTC-based Encoder Pre-training for Streaming End-to-End Speech Recognition

Partial Rewriting for Multi-Stage ASR

Two-Stage Augmentation and Adaptive CTC Fusion for Improved Robustness of Multi-Stream End-to-end ASR.

Dual-mode ASR: Unify and Improve Streaming ASR with Full-context Modeling

State-of-the-Art Speech Recognition Using Multi-Stream Self-Attention With Dilated 1D Convolutions

Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech Recognition

Online Hybrid CTC/Attention End-to-End Automatic Speech Recognition Architecture