Abstract:The transformer is a fundamental building block in deep learning, and the attention mechanism is the transformer's core component. Self-supervised speech representation learning (SSRL) represents a popular use-case for the transformer architecture. Due to transformers' acausal behavior, the use of transformers for SSRL has been predominantly focused on acausal applications. However, several media processing problems, such as speech processing, require real-time solutions. In this paper, we present an implementation of the attention module that enables training of SSRL architectures with low compute and memory requirements, while allowing real-time inference with low and fixed latency. The attention module proposed in this paper includes two components, streaming attention (SA) and low-latency streaming attention (LLSA). The SA represents our proposal for an efficient streaming SSRL implementation, while the LLSA solves the latency build-up problem of other streaming attention architectures, such as the masked acausal attention (MAA), guaranteeing a latency equal to one layer even when multiple layers are stacked. We present a comparative analysis between the vanilla attention, which we will refer here as acausal attention (AA), the SA, and the LLSA, by training a streaming SSRL with automatic speech recognition as downstream task. When training on librispeech-clean-100 and testing on librispeech-test-clean, our low-latency attention module has a word error rate (WER) of 5.84%, which represents a significant improvement over the MAA (WER = 13.82%). Our implementation also reduces the inference latency from 1.92 to 0.16 seconds. The proposed low-latency module preserves many of the benefits of conventional acausal transformers, but also enables latency characteristics that make it applicable to real-time streaming applications.

Lightweight Causal Transformer with Local Self-Attention for Real-Time Speech Enhancement

Lightweight Multi-Axial Transformer with Frequency Prompt for Single Channel Speech Enhancement.

Lightweight Dynamic Sparse Transformer for Monaural Speech Enhancement

Enhancing Local Dependencies for Transformer-Based Text-to-Speech via Hybrid Lightweight Convolution

SETransformer: Speech Enhancement Transformer

Dual-Branch Attention-In-Attention Transformer for Single-Channel Speech Enhancement

Dense-TSNet: Dense Connected Two-Stage Structure for Ultra-Lightweight Speech Enhancement

PCNN: A Lightweight Parallel Conformer Neural Network for Efficient Monaural Speech Enhancement

A low latency attention module for streaming self-supervised speech representation learning

Attention or Convolution: Transformer Encoders in Audio Language Models for Inference Efficiency

Causal speech enhancement using dynamical-weighted loss and attention encoder-decoder recurrent neural network

LRTD: A Low-rank Transformer with Dynamic Depth and Width for Speech Recognition.

Transformer-based End-to-End Speech Recognition with Local Dense Synthesizer Attention

Lookahead When It Matters: Adaptive Non-causal Transformers for Streaming Neural Transducers

Lightweight End-to-End Speech Recognition from Raw Audio Data Using Sinc-Convolutions

TFCN: Temporal-Frequential Convolutional Network for Single-Channel Speech Enhancement

CAT: Causal Audio Transformer for Audio Classification

Explore Relative and Context Information with Transformer for Joint Acoustic Echo Cancellation and Speech Enhancement

DCHT: Deep Complex Hybrid Transformer for Speech Enhancement

Densely Connected Multi-Stage Model with Channel Wise Subband Feature for Real-Time Speech Enhancement.

Neural Speech Synthesis with Transformer Network.