Abstract:Standard Recurrent Neural Network Transducers (RNN-T) decoding algorithms for speech recognition are iterating over the time axis, such that one time step is decoded before moving on to the next time step. Those algorithms result in a large number of calls to the joint network, which were shown in previous work to be an important factor that reduces decoding speed. We present a decoding beam search algorithm that batches the joint network calls across a segment of time steps, which results in 20%-96% decoding speedups consistently across all models and settings experimented with. In addition, aggregating emission probabilities over a segment may be seen as a better approximation to finding the most likely model output, causing our algorithm to improve oracle word error rate by up to 11% relative as the segment size increases, and to slightly improve general word error rate.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to improve the decoding speed and accuracy of the RNN - T (Recurrent Neural Network Transducer) model in speech recognition. Specifically, the traditional RNN - T decoding algorithm processes iteratively frame by frame, causing the joint network to be frequently invoked, thereby increasing the amount of computation and decoding time. To solve this problem, the author proposes a new beam search decoding algorithm - the segment - based token - wise beam search algorithm. ### Problems with the traditional method 1. **Iteration frame by frame**: The standard RNN - T decoding algorithm processes audio input iteratively frame by frame, processing only one time step at a time. 2. **Frequent invocation of the joint network**: Due to the frame - by - frame processing method, the joint network needs to be invoked at each time step, which results in a large amount of computational overhead. 3. **Slow decoding speed**: Frequent invocation of the joint network makes the decoding speed slow, especially when processing longer audio sequences. 4. **Limited accuracy**: Frame - by - frame processing may cause some correct hypotheses to be discarded during the search process, thus affecting the decoding accuracy. ### The proposed new method The token - wise beam search algorithm proposed by the author solves the above problems in the following ways: 1. **Batch processing of multiple time steps**: Instead of processing frame by frame, the new algorithm processes an audio segment containing multiple time steps simultaneously. In this way, the joint network invocations of multiple time steps can be combined into one invocation, reducing the total number of invocations. 2. **Probability aggregation**: When processing a segment, the algorithm aggregates the probabilities of all paths leading to the same token sequence. This aggregation can better approximate the most likely output sequence, thereby improving the decoding accuracy. 3. **Reduction in the number of invocations**: By batch - processing multiple time steps, the number of invocations of the joint network is reduced, thereby improving the computational efficiency and decoding speed. 4. **Strong adaptability**: This algorithm is suitable for offline decoding and non - strictly streaming decoding scenarios, and does not require any modification to the trained RNN - T model. ### Experimental results The experimental results show that using the new token - wise beam search algorithm can significantly improve the decoding speed and accuracy on different datasets: - **Decoding speed**: When using a segment size of 3 to 5 time steps, the decoding speed is increased by 20% - 96%. - **Accuracy**: As the segment size increases, the decoding accuracy also improves, especially in terms of Oracle WER (Word Error Rate of the best hypothesis), with a maximum relative improvement of 11%. In conclusion, this paper proposes a new decoding algorithm that effectively improves the decoding speed and accuracy of the RNN - T model in speech recognition tasks by batch - processing multiple time steps and aggregating probabilities.

A Token-Wise Beam Search Algorithm for RNN-T

Efficient minimum word error rate training of RNN-Transducer for end-to-end speech recognition

Alignment Restricted Streaming Recurrent Neural Network Transducer.

Attention-based Transducer for Online Speech Recognition

Segment-Level Vectorized Beam Search Based on Partially Autoregressive Inference

Self-Attention Transducers for End-to-End Speech Recognition

TST: Time-Sparse Transducer for Automatic Speech Recognition

Navigating the Minefield of MT Beam Search in Cascaded Streaming Speech Translation

Efficient Sequence Transduction by Jointly Predicting Tokens and Durations

Enabling Beam Search for Language Model-Based Text-to-Speech Synthesis

Label-Looping: Highly Efficient Decoding for Transducers

Edit Distance based RL for RNNT decoding

Speed of Light Exact Greedy Decoding for RNN-T Speech Recognition Models on GPU

Pruned RNN-T for fast, memory-efficient ASR training

Improving RNN transducer with normalized jointer network

GPU-Accelerated WFST Beam Search Decoder for CTC-based Speech Recognition

Improving RNN-Transducers with Acoustic LookAhead

Streaming Align-Refine for Non-autoregressive Deliberation

A Novel Beam Search Algorithm Of Speech Recognition For Voice Command Control

Frame Stacking and Retaining for Recurrent Neural Network Acoustic Model

Incremental Blockwise Beam Search for Simultaneous Speech Translation with Controllable Quality-Latency Tradeoff