Abstract:This paper proposes a simple yet effective way of regularising the encoder-decoder-based automatic speech recognition (ASR) models that enhance the robustness of the model and improve the generalisation to out-of-domain scenarios. The proposed approach is dubbed as $\textbf{De}$coder-$\textbf{C}$entric $\textbf{R}$egularisation in $\textbf{E}$ncoder-$\textbf{D}$ecoder (DeCRED) architecture for ASR, where auxiliary classifier(s) is introduced in layers of the decoder module. Leveraging these classifiers, we propose two decoding strategies that re-estimate the next token probabilities. Using the recent E-branchformer architecture, we build strong ASR systems that obtained competitive WERs as compared to Whisper-medium and outperformed OWSM v3; while relying only on a fraction of training data and model size. On top of such a strong baseline, we show that DeCRED can further improve the results and, moreover, generalise much better to out-of-domain scenarios, where we show an absolute reduction of 2.7 and 2.9 WERs on AMI and Gigaspeech datasets, respectively. We provide extensive analysis and accompanying experiments that support the benefits of the proposed regularisation scheme.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of the generalization ability of automatic speech recognition (ASR) models in new or unseen domains. Specifically, the authors propose a new regularization method - **Decoder - Centric Regularisation (DeCRED)** to enhance the robustness and generalization ability of ASR models based on the encoder - decoder architecture. #### Main problems and challenges: 1. **Challenges of generalizing to new domains**: Existing ASR models perform poorly in new domains outside the training data, especially when facing diverse audio data. 2. **Over - fitting problem**: Large - scale ASR models are prone to over - fitting within specific domains, resulting in a decline in their performance on unseen data. 3. **Resource utilization efficiency**: How to improve the performance and generalization ability of the model with limited computing resources. #### Solutions: - **Introducing an auxiliary classifier**: Introduce an auxiliary classifier in the intermediate layer of the decoder module to regularize the internal language model (ILM), prevent over - fitting and improve generalization ability. - **Proposing two decoding strategies**: Use these auxiliary classifiers to re - estimate the probability of the next token, thereby improving the decoding process. - **Experimental verification**: Verify the effectiveness of DeCRED through experiments on multiple multi - domain datasets and demonstrate its significant performance improvement on unseen datasets. #### Specific improvements: - **Model architecture**: Build a powerful ASR system based on the E - branchformer architecture and combine multiple known techniques for enhancing model robustness. - **Regularization method**: Further optimize the generalization ability of the model by adjusting the weights of the auxiliary classifier. - **Experimental results**: Achieve an absolute word error rate (WER) reduction of 2.7% and 2.9% on the AMI and Gigaspeech datasets respectively, demonstrating the superior performance of DeCRED in unseen domains. In conclusion, this paper effectively improves the generalization ability and robustness of ASR models in new domains by introducing the decoder - centric regularization method, and solves the problem of performance degradation of existing models on unseen data.

Improving Automatic Speech Recognition with Decoder-Centric Regularisation in Encoder-Decoder Models

Rethinking Speech Recognition with A Multimodal Perspective via Acoustic and Semantic Cooperative Decoding

Nonlinear Regularization Decoding Method for Speech Recognition

Sequence-to-sequence Automatic Speech Recognition with Word Embedding Regularization and Fused Decoding

Decoupled Structure for Improved Adaptability of End-to-End Models

Audio-Visual Efficient Conformer for Robust Speech Recognition

Continual Learning Optimizations for Auto-regressive Decoder of Multilingual ASR systems

Streaming Audio-Visual Speech Recognition with Alignment Regularization

Improving Non-Autoregressive End-to-End Speech Recognition with Pre-Trained Acoustic and Language Models

Decoder-only Architecture for Speech Recognition with CTC Prompts and Text Data Augmentation

Towards Decoupling Frontend Enhancement and Backend Recognition in Monaural Robust ASR

Perception and Semantic Aware Regularization for Sequential Confidence Calibration

Speech enhancement with frequency domain auto-regressive modeling

Enhancing dysarthric speech recognition through SepFormer and hierarchical attention network models with multistage transfer learning

Exploring the limits of decoder-only models trained on public speech recognition corpora

Forward-Backward Decoding for Regularizing End-to-End TTS

Label-Synchronous Neural Transducer for Adaptable Online E2E Speech Recognition

A Deliberation-based Joint Acoustic and Text Decoder

USTED: Improving ASR with a Unified Speech and Text Encoder-Decoder

Scaling Up Deliberation for Multilingual ASR

Decoder-only Architecture for Streaming End-to-end Speech Recognition