Abstract:The transformer is a fundamental building block in deep learning, and the attention mechanism is the transformer's core component. Self-supervised speech representation learning (SSRL) represents a popular use-case for the transformer architecture. Due to transformers' acausal behavior, the use of transformers for SSRL has been predominantly focused on acausal applications. However, several media processing problems, such as speech processing, require real-time solutions. In this paper, we present an implementation of the attention module that enables training of SSRL architectures with low compute and memory requirements, while allowing real-time inference with low and fixed latency. The attention module proposed in this paper includes two components, streaming attention (SA) and low-latency streaming attention (LLSA). The SA represents our proposal for an efficient streaming SSRL implementation, while the LLSA solves the latency build-up problem of other streaming attention architectures, such as the masked acausal attention (MAA), guaranteeing a latency equal to one layer even when multiple layers are stacked. We present a comparative analysis between the vanilla attention, which we will refer here as acausal attention (AA), the SA, and the LLSA, by training a streaming SSRL with automatic speech recognition as downstream task. When training on librispeech-clean-100 and testing on librispeech-test-clean, our low-latency attention module has a word error rate (WER) of 5.84%, which represents a significant improvement over the MAA (WER = 13.82%). Our implementation also reduces the inference latency from 1.92 to 0.16 seconds. The proposed low-latency module preserves many of the benefits of conventional acausal transformers, but also enables latency characteristics that make it applicable to real-time streaming applications.

Exploring RWKV for Memory Efficient and Low Latency Streaming ASR

RRWKV: Capturing Long-range Dependencies in RWKV

RWKV: Reinventing RNNs for the Transformer Era

A low latency attention module for streaming self-supervised speech representation learning

ConvRNN-T: Convolutional Augmented Recurrent Neural Network Transducers for Streaming Speech Recognition

Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers

Streaming Align-Refine for Non-autoregressive Deliberation

End-to-End ASR with Adaptive Span Self-Attention

Video RWKV:Video Action Recognition Based RWKV

The Evolution of RWKV: Advancements in Efficient Language Modeling

Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures

XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models

Partial Rewriting for Multi-Stage ASR

Adaptive Sparse and Monotonic Attention for Transformer-based Automatic Speech Recognition

Dynamic Chunk Convolution for Unified Streaming and Non-Streaming Conformer ASR

LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory

Efficient Streaming Language Models with Attention Sinks

Lookahead When It Matters: Adaptive Non-causal Transformers for Streaming Neural Transducers

Shifted Chunk Encoder for Transformer Based Streaming End-to-End ASR

Self-regularised Minimum Latency Training for Streaming Transformer-based Speech Recognition