Abstract:A hybrid autoregressive transducer (HAT) is a variant of neural transducer that models blank and non-blank posterior distributions separately. In this paper, we propose a novel internal acoustic model (IAM) training strategy to enhance HAT-based speech recognition. IAM consists of encoder and joint networks, which are fully shared and jointly trained with HAT. This joint training not only enhances the HAT training efficiency but also encourages IAM and HAT to emit blanks synchronously which skips the more expensive non-blank computation, resulting in more effective blank thresholding for faster decoding. Experiments demonstrate that the relative error reductions of the HAT with IAM compared to the vanilla HAT are statistically significant. Moreover, we introduce dual blank thresholding, which combines both HAT- and IAM-blank thresholding and a compatible decoding algorithm. This results in a 42-75% decoding speed-up with no major performance degradation.

What problem does this paper attempt to address?

This paper attempts to solve two key problems in the field of Automatic Speech Recognition (ASR): 1. **Improve the performance of the HAT model**: By introducing the Internal Acoustic Model (IAM) for joint training, the performance of the Hybrid Autoregressive Transducer (HAT) in automatic speech recognition is enhanced. Specifically, the authors propose a joint training strategy for IAM and HAT, in which IAM and HAT share the encoder and the joint network and emit blank symbols synchronously, thus improving the training efficiency and decoding speed. 2. **Accelerate the decoding process**: By introducing the dual blank thresholding technique, combined with the blank thresholding methods of HAT and IAM, unnecessary non - blank probability calculations are reduced, thereby significantly speeding up the decoding speed. In addition, the authors also explore compatible decoding algorithms to mitigate the performance degradation caused by incorrect frame skipping. ### Main contributions - **Introduction of IAM**: IAM consists of an encoder and a joint network, fully shares parameters with HAT and is jointly trained, enabling IAM and HAT to emit blank symbols synchronously, thereby achieving more effective blank thresholding. - **Dual blank thresholding**: Combining the blank thresholding methods of HAT and IAM, the decoding speed is further improved through two - step blank thresholding. First, CTC - blank thresholding is used to eliminate unnecessary encoder outputs, and then the remaining encoder outputs are passed to the HAT decoder, applying more reliable HAT - blank thresholding. - **Decoding algorithm optimization**: Two popular decoding algorithms, Alignment Length Synchronized Decoding (ALSD) and Time Synchronized Decoding (TSD), are studied to alleviate the performance degradation caused by incorrect blank thresholding. ### Experimental results Experiments show that all CTC objective functions can enhance the performance of HAT, and there is a statistically significant relative error reduction compared to the original HAT in both offline and streaming modes. In particular, using the methods of IAM and dual blank thresholding can achieve 42% - 75% decoding acceleration without significantly degrading ASR performance. ### Conclusion Through the joint training of IAM and the dual blank thresholding technique, this paper successfully improves the performance of the HAT model and significantly speeds up the decoding speed. These improvements provide strong support for fast response and efficient processing in practical applications.

Boosting Hybrid Autoregressive Transducer-based ASR with Internal Acoustic Model Training and Dual Blank Thresholding

Hybrid Autoregressive Transducer (hat)

Three-in-One: Fast and Accurate Transducer for Hybrid-Autoregressive ASR

4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict decoders

Hybrid Autoregressive and Non-Autoregressive Transformer Models for Speech Recognition

Alignment-Free Training for Transducer-based Multi-Talker ASR

Improving Non-Autoregressive End-to-End Speech Recognition with Pre-Trained Acoustic and Language Models

Rethinking Speech Recognition with A Multimodal Perspective via Acoustic and Semantic Cooperative Decoding

Improving Hybrid CTC/Attention End-to-end Speech Recognition with Pretrained Acoustic and Language Model

TST: Time-Sparse Transducer for Automatic Speech Recognition

Combining Hybrid DNN-HMM ASR Systems with Attention-Based Models Using Lattice Rescoring

Multi-blank Transducers for Speech Recognition

Non-Autoregressive End-To-End Automatic Speech Recognition Incorporating Downstream Natural Language Processing

A Deliberation-based Joint Acoustic and Text Decoder

Improving Automatic Speech Recognition for Non-Native English with Transfer Learning and Language Model Decoding

Hybrid Transducer and Attention based Encoder-Decoder Modeling for Speech-to-Text Tasks

A CTC Alignment-based Non-autoregressive Transformer for End-to-end Automatic Speech Recognition

Improving Scheduled Sampling for Neural Transducer-based ASR

Hidden Markov Acoustic Modeling with Bootstrap and Restructuring for Low-Resourced Languages

Improving RNN transducer with normalized jointer network

Self-Supervised Adaptive AV Fusion Module for Pre-Trained ASR Models