Takafumi Moriya,Takanori Ashihara,Masato Mimura,Hiroshi Sato,Kohei Matsuura,Ryo Masumura,Taichi Asami
Abstract:A hybrid autoregressive transducer (HAT) is a variant of neural transducer that models blank and non-blank posterior distributions separately. In this paper, we propose a novel internal acoustic model (IAM) training strategy to enhance HAT-based speech recognition. IAM consists of encoder and joint networks, which are fully shared and jointly trained with HAT. This joint training not only enhances the HAT training efficiency but also encourages IAM and HAT to emit blanks synchronously which skips the more expensive non-blank computation, resulting in more effective blank thresholding for faster decoding. Experiments demonstrate that the relative error reductions of the HAT with IAM compared to the vanilla HAT are statistically significant. Moreover, we introduce dual blank thresholding, which combines both HAT- and IAM-blank thresholding and a compatible decoding algorithm. This results in a 42-75% decoding speed-up with no major performance degradation.
What problem does this paper attempt to address?
This paper attempts to solve two key problems in the field of Automatic Speech Recognition (ASR):
1. **Improve the performance of the HAT model**: By introducing the Internal Acoustic Model (IAM) for joint training, the performance of the Hybrid Autoregressive Transducer (HAT) in automatic speech recognition is enhanced. Specifically, the authors propose a joint training strategy for IAM and HAT, in which IAM and HAT share the encoder and the joint network and emit blank symbols synchronously, thus improving the training efficiency and decoding speed.
2. **Accelerate the decoding process**: By introducing the dual blank thresholding technique, combined with the blank thresholding methods of HAT and IAM, unnecessary non - blank probability calculations are reduced, thereby significantly speeding up the decoding speed. In addition, the authors also explore compatible decoding algorithms to mitigate the performance degradation caused by incorrect frame skipping.
### Main contributions
- **Introduction of IAM**: IAM consists of an encoder and a joint network, fully shares parameters with HAT and is jointly trained, enabling IAM and HAT to emit blank symbols synchronously, thereby achieving more effective blank thresholding.
- **Dual blank thresholding**: Combining the blank thresholding methods of HAT and IAM, the decoding speed is further improved through two - step blank thresholding. First, CTC - blank thresholding is used to eliminate unnecessary encoder outputs, and then the remaining encoder outputs are passed to the HAT decoder, applying more reliable HAT - blank thresholding.
- **Decoding algorithm optimization**: Two popular decoding algorithms, Alignment Length Synchronized Decoding (ALSD) and Time Synchronized Decoding (TSD), are studied to alleviate the performance degradation caused by incorrect blank thresholding.
### Experimental results
Experiments show that all CTC objective functions can enhance the performance of HAT, and there is a statistically significant relative error reduction compared to the original HAT in both offline and streaming modes. In particular, using the methods of IAM and dual blank thresholding can achieve 42% - 75% decoding acceleration without significantly degrading ASR performance.
### Conclusion
Through the joint training of IAM and the dual blank thresholding technique, this paper successfully improves the performance of the HAT model and significantly speeds up the decoding speed. These improvements provide strong support for fast response and efficient processing in practical applications.