Improving Automatic Speech Recognition with Decoder-Centric Regularisation in Encoder-Decoder Models

Alexander Polok,Santosh Kesiraju,Karel Beneš,Lukáš Burget,Jan Černocký
2024-10-23
Abstract:This paper proposes a simple yet effective way of regularising the encoder-decoder-based automatic speech recognition (ASR) models that enhance the robustness of the model and improve the generalisation to out-of-domain scenarios. The proposed approach is dubbed as $\textbf{De}$coder-$\textbf{C}$entric $\textbf{R}$egularisation in $\textbf{E}$ncoder-$\textbf{D}$ecoder (DeCRED) architecture for ASR, where auxiliary classifier(s) is introduced in layers of the decoder module. Leveraging these classifiers, we propose two decoding strategies that re-estimate the next token probabilities. Using the recent E-branchformer architecture, we build strong ASR systems that obtained competitive WERs as compared to Whisper-medium and outperformed OWSM v3; while relying only on a fraction of training data and model size. On top of such a strong baseline, we show that DeCRED can further improve the results and, moreover, generalise much better to out-of-domain scenarios, where we show an absolute reduction of 2.7 and 2.9 WERs on AMI and Gigaspeech datasets, respectively. We provide extensive analysis and accompanying experiments that support the benefits of the proposed regularisation scheme.
Audio and Speech Processing
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of the generalization ability of automatic speech recognition (ASR) models in new or unseen domains. Specifically, the authors propose a new regularization method - **Decoder - Centric Regularisation (DeCRED)** to enhance the robustness and generalization ability of ASR models based on the encoder - decoder architecture. #### Main problems and challenges: 1. **Challenges of generalizing to new domains**: Existing ASR models perform poorly in new domains outside the training data, especially when facing diverse audio data. 2. **Over - fitting problem**: Large - scale ASR models are prone to over - fitting within specific domains, resulting in a decline in their performance on unseen data. 3. **Resource utilization efficiency**: How to improve the performance and generalization ability of the model with limited computing resources. #### Solutions: - **Introducing an auxiliary classifier**: Introduce an auxiliary classifier in the intermediate layer of the decoder module to regularize the internal language model (ILM), prevent over - fitting and improve generalization ability. - **Proposing two decoding strategies**: Use these auxiliary classifiers to re - estimate the probability of the next token, thereby improving the decoding process. - **Experimental verification**: Verify the effectiveness of DeCRED through experiments on multiple multi - domain datasets and demonstrate its significant performance improvement on unseen datasets. #### Specific improvements: - **Model architecture**: Build a powerful ASR system based on the E - branchformer architecture and combine multiple known techniques for enhancing model robustness. - **Regularization method**: Further optimize the generalization ability of the model by adjusting the weights of the auxiliary classifier. - **Experimental results**: Achieve an absolute word error rate (WER) reduction of 2.7% and 2.9% on the AMI and Gigaspeech datasets respectively, demonstrating the superior performance of DeCRED in unseen domains. In conclusion, this paper effectively improves the generalization ability and robustness of ASR models in new domains by introducing the decoder - centric regularization method, and solves the problem of performance degradation of existing models on unseen data.