Abstract:Recently, end-to-end models have been widely used in automatic speech recognition (ASR) systems. Two of the most representative approaches are connectionist temporal classification (CTC) and attention-based encoder-decoder (AED) models. Autoregressive transformers, variants of AED, adopt an autoregressive mechanism for token generation and thus are relatively slow during inference. In this paper, we present a comprehensive study of a CTC Alignment-based Single-Step Non-Autoregressive Transformer (CASS-NAT) for end-to-end ASR. In CASS-NAT, word embeddings in the autoregressive transformer (AT) are substituted with token-level acoustic embeddings (TAE) that are extracted from encoder outputs with the acoustical boundary information offered by the CTC alignment. TAE can be obtained in parallel, resulting in a parallel generation of output tokens. During training, Viterbi-alignment is used for TAE generation, and multiple training strategies are further explored to improve the word error rate (WER) performance. During inference, an error-based alignment sampling method is investigated in depth to reduce the alignment mismatch in the training and testing processes. Experimental results show that the CASS-NAT has a WER that is close to AT on various ASR tasks, while providing a ~24x inference speedup. With and without self-supervised learning, we achieve new state-of-the-art results for non-autoregressive models on several datasets. We also analyze the behavior of the CASS-NAT decoder to explain why it can perform similarly to AT. We find that TAEs have similar functionality to word embeddings for grammatical structures, which might indicate the possibility of learning some semantic information from TAEs without a language model.

End-to-End ASR with Adaptive Span Self-Attention

SAC: Accelerating and Structuring Self-Attention Via Sparse Adaptive Connection.

Adaptive Sparse and Monotonic Attention for Transformer-based Automatic Speech Recognition

Improving Transformer-based Conversational ASR by Inter-Sentential Attention Mechanism

SIMPLIFIED SELF-ATTENTION FOR TRANSFORMER-BASED END-TO-END SPEECH RECOGNITION

Context-Aware end-to-end ASR Using Self-Attentive Embedding and Tensor Fusion

Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers

Adaptive Multi-Resolution Attention with Linear Complexity

Dual-Branch Attention-In-Attention Transformer for Single-Channel Speech Enhancement

Transformer-based end-to-end speech recognition with residual Gaussian-based self-attention

LAS-Transformer: An Enhanced Transformer Based on the Local Attention Mechanism for Speech Recognition

A Window Attention Based Transformer for Automatic Speech Recognition

Transformer-based End-to-End Speech Recognition with Local Dense Synthesizer Attention

A CTC Alignment-based Non-autoregressive Transformer for End-to-end Automatic Speech Recognition

Spike-Triggered Non-Autoregressive Transformer for End-to-End Speech Recognition

Multi Resolution Analysis (MRA) for Approximate Self-Attention

Non-autoregressive Transformer-based End-to-end ASR using BERT

Self-Attention Transducers for End-to-End Speech Recognition

Chunk, Align, Select: A Simple Long-sequence Processing Method for Transformers

Echotune: A Modular Extractor Leveraging the Variable-Length Nature of Speech in ASR Tasks

Exploration of Efficient End-to-End ASR using Discretized Input from Self-Supervised Learning