Abstract:Standard Recurrent Neural Network Transducers (RNN-T) decoding algorithms for speech recognition are iterating over the time axis, such that one time step is decoded before moving on to the next time step. Those algorithms result in a large number of calls to the joint network, which were shown in previous work to be an important factor that reduces decoding speed. We present a decoding beam search algorithm that batches the joint network calls across a segment of time steps, which results in 20%-96% decoding speedups consistently across all models and settings experimented with. In addition, aggregating emission probabilities over a segment may be seen as a better approximation to finding the most likely model output, causing our algorithm to improve oracle word error rate by up to 11% relative as the segment size increases, and to slightly improve general word error rate.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to improve the decoding speed and accuracy of the RNN - T (Recurrent Neural Network Transducer) model in speech recognition. Specifically, the traditional RNN - T decoding algorithm processes iteratively frame by frame, causing the joint network to be frequently invoked, thereby increasing the amount of computation and decoding time. To solve this problem, the author proposes a new beam search decoding algorithm - the segment - based token - wise beam search algorithm.
### Problems with the traditional method
1. **Iteration frame by frame**: The standard RNN - T decoding algorithm processes audio input iteratively frame by frame, processing only one time step at a time.
2. **Frequent invocation of the joint network**: Due to the frame - by - frame processing method, the joint network needs to be invoked at each time step, which results in a large amount of computational overhead.
3. **Slow decoding speed**: Frequent invocation of the joint network makes the decoding speed slow, especially when processing longer audio sequences.
4. **Limited accuracy**: Frame - by - frame processing may cause some correct hypotheses to be discarded during the search process, thus affecting the decoding accuracy.
### The proposed new method
The token - wise beam search algorithm proposed by the author solves the above problems in the following ways:
1. **Batch processing of multiple time steps**: Instead of processing frame by frame, the new algorithm processes an audio segment containing multiple time steps simultaneously. In this way, the joint network invocations of multiple time steps can be combined into one invocation, reducing the total number of invocations.
2. **Probability aggregation**: When processing a segment, the algorithm aggregates the probabilities of all paths leading to the same token sequence. This aggregation can better approximate the most likely output sequence, thereby improving the decoding accuracy.
3. **Reduction in the number of invocations**: By batch - processing multiple time steps, the number of invocations of the joint network is reduced, thereby improving the computational efficiency and decoding speed.
4. **Strong adaptability**: This algorithm is suitable for offline decoding and non - strictly streaming decoding scenarios, and does not require any modification to the trained RNN - T model.
### Experimental results
The experimental results show that using the new token - wise beam search algorithm can significantly improve the decoding speed and accuracy on different datasets:
- **Decoding speed**: When using a segment size of 3 to 5 time steps, the decoding speed is increased by 20% - 96%.
- **Accuracy**: As the segment size increases, the decoding accuracy also improves, especially in terms of Oracle WER (Word Error Rate of the best hypothesis), with a maximum relative improvement of 11%.
In conclusion, this paper proposes a new decoding algorithm that effectively improves the decoding speed and accuracy of the RNN - T model in speech recognition tasks by batch - processing multiple time steps and aggregating probabilities.