Abstract:Recently, end-to-end models have been widely used in automatic speech recognition (ASR) systems. Two of the most representative approaches are connectionist temporal classification (CTC) and attention-based encoder-decoder (AED) models. Autoregressive transformers, variants of AED, adopt an autoregressive mechanism for token generation and thus are relatively slow during inference. In this paper, we present a comprehensive study of a CTC Alignment-based Single-Step Non-Autoregressive Transformer (CASS-NAT) for end-to-end ASR. In CASS-NAT, word embeddings in the autoregressive transformer (AT) are substituted with token-level acoustic embeddings (TAE) that are extracted from encoder outputs with the acoustical boundary information offered by the CTC alignment. TAE can be obtained in parallel, resulting in a parallel generation of output tokens. During training, Viterbi-alignment is used for TAE generation, and multiple training strategies are further explored to improve the word error rate (WER) performance. During inference, an error-based alignment sampling method is investigated in depth to reduce the alignment mismatch in the training and testing processes. Experimental results show that the CASS-NAT has a WER that is close to AT on various ASR tasks, while providing a ~24x inference speedup. With and without self-supervised learning, we achieve new state-of-the-art results for non-autoregressive models on several datasets. We also analyze the behavior of the CASS-NAT decoder to explain why it can perform similarly to AT. We find that TAEs have similar functionality to word embeddings for grammatical structures, which might indicate the possibility of learning some semantic information from TAEs without a language model.

Non-Autoregressive End-To-End Automatic Speech Recognition Incorporating Downstream Natural Language Processing

Improving Non-Autoregressive End-to-End Speech Recognition with Pre-Trained Acoustic and Language Models

A CTC Alignment-based Non-autoregressive Transformer for End-to-end Automatic Speech Recognition

Non-autoregressive End-to-end Approaches for Joint Automatic Speech Recognition and Spoken Language Understanding

Unified End-to-End Speech Recognition and Endpointing for Fast and Efficient Speech Systems

LV-CTC: Non-autoregressive ASR with CTC and latent variable models

4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict decoders

End-to-End Speech Recognition with Pre-trained Masked Language Model

EffectiveASR: A Single-Step Non-Autoregressive Mandarin Speech Recognition Architecture with High Accuracy and Inference Speed

Integrating Pre-Trained Speech and Language Models for End-to-End Speech Recognition

Modular End-to-End Automatic Speech Recognition Framework for Acoustic-to-Word Model

A Neural Time Alignment Module for End-to-End Automatic Speech Recognition

WNARS: WFST based Non-autoregressive Streaming End-to-End Speech Recognition

Online Hybrid CTC/Attention End-to-End Automatic Speech Recognition Architecture

Recent Advances in End-to-End Automatic Speech Recognition

Alignment-Free Training for Transducer-based Multi-Talker ASR

End-to-End Joint Target and Non-Target Speakers ASR

Non-Autoregressive End-to-End TTS with Coarse-to-Fine Decoding

ON MODULAR TRAINING OF NEURAL ACOUSTICS-TO-WORD MODEL FOR LVCSR

Non-Autoregressive Transformer ASR with CTC-Enhanced Decoder Input

Enhancing CTC-based speech recognition with diverse modeling units