Abstract:The transformer is a fundamental building block in deep learning, and the attention mechanism is the transformer's core component. Self-supervised speech representation learning (SSRL) represents a popular use-case for the transformer architecture. Due to transformers' acausal behavior, the use of transformers for SSRL has been predominantly focused on acausal applications. However, several media processing problems, such as speech processing, require real-time solutions. In this paper, we present an implementation of the attention module that enables training of SSRL architectures with low compute and memory requirements, while allowing real-time inference with low and fixed latency. The attention module proposed in this paper includes two components, streaming attention (SA) and low-latency streaming attention (LLSA). The SA represents our proposal for an efficient streaming SSRL implementation, while the LLSA solves the latency build-up problem of other streaming attention architectures, such as the masked acausal attention (MAA), guaranteeing a latency equal to one layer even when multiple layers are stacked. We present a comparative analysis between the vanilla attention, which we will refer here as acausal attention (AA), the SA, and the LLSA, by training a streaming SSRL with automatic speech recognition as downstream task. When training on librispeech-clean-100 and testing on librispeech-test-clean, our low-latency attention module has a word error rate (WER) of 5.84%, which represents a significant improvement over the MAA (WER = 13.82%). Our implementation also reduces the inference latency from 1.92 to 0.16 seconds. The proposed low-latency module preserves many of the benefits of conventional acausal transformers, but also enables latency characteristics that make it applicable to real-time streaming applications.

XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models

Fast Streaming Transducer ASR Prototyping via Knowledge Distillation with Whisper

Transducer-Llama: Integrating LLMs into Streamable Transducer-based Speech Recognition

A low latency attention module for streaming self-supervised speech representation learning

Improving Hybrid CTC/Attention End-to-end Speech Recognition with Pretrained Acoustic and Language Model

Self-regularised Minimum Latency Training for Streaming Transformer-based Speech Recognition

LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and Translation Using Neural Transducers

XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale

Adaptive Sparse and Monotonic Attention for Transformer-based Automatic Speech Recognition

Improving Automatic Speech Recognition Performance for Low-Resource Languages With Self-Supervised Models

Audio xLSTMs: Learning Self-Supervised Audio Representations with xLSTMs

Whisper in Medusa's Ear: Multi-head Efficient Decoding for Transformer-based ASR

Injecting Text in Self-Supervised Speech Pretraining

Transformer Transducer: One Model Unifying Streaming and Non-streaming Speech Recognition

Exploring Self-Attention Mechanisms for Speech Separation

Are Transformers in Pre-trained LM A Good ASR Encoder? An Empirical Study

End-to-End ASR with Adaptive Span Self-Attention

Simul-Whisper: Attention-Guided Streaming Whisper with Truncation Detection

Attention or Convolution: Transformer Encoders in Audio Language Models for Inference Efficiency

Transsion TSUP's speech recognition system for ASRU 2023 MADASR Challenge

Token-Level Serialized Output Training for Joint Streaming ASR and ST Leveraging Textual Alignments