Abstract:Latency is a crucial metric for streaming speech recognition systems. In this paper, we reduce latency by fetching responses early based on the partial recognition results and refer to it as prefetching . Speciﬁcally, prefetching works by submitting partial recognition results for subsequent processing such as obtaining assistant server responses or second-pass rescoring before the recognition result is ﬁnalized. If the partial result matches the ﬁnal recognition result, the early fetched response can be delivered to the user instantly. This effectively speeds up the system by saving the execution latency that typically happens after recognition is completed. Prefetching can be triggered multiple times for a single query, but this leads to multiple rounds of downstream processing and increases the computation costs. It is hence desirable to fetch the result sooner but meanwhile limiting the number of prefetches. To achieve the best trade-off between latency and computation cost, we investigated a series of prefetching decision models including decoder silence based prefetching, acoustic silence based prefetching and end-to-end prefetching. In this paper, we demonstrate the proposed prefetching mechanism reduced latency by ∼ 200 ms for a system that consists of a streaming ﬁrst pass model using recurrent neural network transducer and a non-streaming second pass rescoring model using Listen, Attend and Spell. We observe that the end-to-end prefetching provides the best trade-off between cost and latency and is 120 ms faster compared to silence based prefetching at a ﬁxed prefetch rate.

FastEmit: Low-Latency Streaming ASR with Sequence-Level Emission Regularization

A Better and Faster end-to-end Model for Streaming ASR

Towards Fast and Accurate Streaming End-To-End ASR

Reducing Streaming ASR Model Delay with Self Alignment

TrimTail: Low-Latency Streaming ASR with Simple but Effective Spectrogram-Level Length Penalty

Extreme Encoder Output Frame Rate Reduction: Improving Computational Latencies of Large End-to-End Models

Low Latency Speech Recognition Using End-to-End Prefetching

Minimum Latency Training of Sequence Transducers for Streaming End-to-End Speech Recognition

A CIF-Based Speech Segmentation Method for Streaming E2E ASR

Self-regularised Minimum Latency Training for Streaming Transformer-based Speech Recognition

A Language Agnostic Multilingual Streaming On-Device ASR System

Streaming Audio-Visual Speech Recognition with Alignment Regularization

Streaming Chunk-Aware Multihead Attention for Online End-to-End Speech Recognition

Compute Cost Amortized Transformer for Streaming ASR

Building Accurate Low Latency ASR for Streaming Voice Search

Extremely Low Footprint End-to-End ASR System for Smart Device

A Streaming On-Device End-to-End Model Surpassing Server-Side Conventional Model Quality and Latency

Efficient Streaming LLM for Speech Recognition

A Truly Multilingual First Pass and Monolingual Second Pass Streaming on-Device ASR System

Accelerating Transducers through Adjacent Token Merging

Low Latency ASR for Simultaneous Speech Translation