Abstract:End-to-end multilingual speech recognition models handle multiple languages through a single model, often incorporating language identification to automatically detect the language of incoming speech. Since the common scenario is where the language is already known, these models can perform as language-specific by using language information as prompts, which is particularly beneficial for attention-based encoder-decoder architectures. However, the Connectionist Temporal Classification (CTC) approach, which enhances recognition via joint decoding and multi-task training, does not normally incorporate language prompts due to its conditionally independent output tokens. To overcome this, we introduce an encoder prompting technique within the self-conditioned CTC framework, enabling language-specific adaptation of the CTC model in a zero-shot manner. Our method has shown to significantly reduce errors by 28% on average and by 41% on low-resource languages.

What problem does this paper attempt to address?

This paper attempts to solve an important problem in end - to - end multilingual speech recognition (E2E multilingual speech recognition): how to improve the accuracy of speech recognition by providing language prompts when the input language is known. Specifically, the paper focuses on how to effectively incorporate language prompt information into the model based on Connectionist Temporal Classification (CTC). ### Main problems of the paper 1. **Language - specific adaptability in multilingual speech recognition**: - In multilingual speech recognition tasks, the language is usually known, so the model can be made more focused on the recognition of a specific language by providing language prompts. - Traditional attention - based encoder - decoder architectures can significantly improve performance by providing language IDs as prompts, but the CTC method cannot directly utilize language prompts due to its conditional independence assumption. 2. **Challenges of language prompts in CTC models**: - The output of the CTC model is conditionally independent at each time step, which means that it cannot dynamically adjust the output according to the context like the attention mechanism. - Therefore, when dealing with multilingual tasks, the CTC model has difficulty in improving the recognition effect through simple language prompts. ### Solutions To solve the above problems, the paper proposes a new encoder - prompting technique based on the self - conditioned CTC (SC - CTC) framework. The specific steps are as follows: 1. **Introducing the SC - CTC framework**: - SC - CTC alleviates the conditional independence assumption of CTC by calculating the CTC loss in the intermediate layer of the encoder and feeding the intermediate prediction results back to the next layer. 2. **Encoder prompting**: - In the inference stage, language prompt information is incorporated into the encoder by modifying the output probabilities of the intermediate layer. Specifically, the paper proposes three modification methods: - **Replacement**: Only modify the frames with the highest language ID probability. - **Aggregation**: Aggregate the language ID probabilities in all frames to the target language ID. - **Prefix**: Only modify the first few frames to represent the prompt information. 3. **Soft Prompting**: - When the input language can be one of multiple languages, a soft prompting method is proposed to allow the model to adapt according to multiple candidate languages. ### Experimental results The paper verifies the effectiveness of the proposed method through experiments on multiple large - scale multilingual datasets such as Common Voice, VoxForge and FLEURS. The results show that this method reduces the average relative error rate by 28%, and for languages with extremely scarce resources (less than 5 hours of training data), the error rate is reduced by 41%. ### Conclusion The paper proposes a new multilingual speech recognition adaptation technique, which realizes rapid adaptation to known input languages through encoder prompting. This method not only significantly improves the recognition performance but also can be flexibly applied in a multilingual environment, especially for languages with scarce resources.

Rapid Language Adaptation for Multilingual E2E Speech Recognition Using Encoder Prompting

End-to-end Code-switched TTS with Mix of Monolingual Recordings.

Non-Autoregressive End-To-End Automatic Speech Recognition Incorporating Downstream Natural Language Processing

Tri-stage training with language-specific encoder and bilingual acoustic learner for code-switching speech recognition

Improving Non-Autoregressive End-to-End Speech Recognition with Pre-Trained Acoustic and Language Models

Chain-of-Thought Prompting for Speech Translation

Enhancing Code-Switching ASR Leveraging Non-Peaky CTC Loss and Deep Language Posterior Injection

LAE-ST-MoE: Boosted Language-Aware Encoder Using Speech Translation Auxiliary Task for E2E Code-switching ASR

Decoder-only Architecture for Speech Recognition with CTC Prompts and Text Data Augmentation

Decoder-only Architecture for Streaming End-to-end Speech Recognition

Prompting and Adapter Tuning for Self-supervised Encoder-Decoder Speech Model

INTapt: Information-Theoretic Adversarial Prompt Tuning for Enhanced Non-Native Speech Recognition

Enhancing Multilingual Speech Recognition through Language Prompt Tuning and Frame-Level Language Adapter

Improving Transformer Based End-to-End Code-Switching Speech Recognition Using Language Identification

Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition

End-to-End Speech Recognition with Pre-trained Masked Language Model

BERT Meets CTC: New Formulation of End-to-End Speech Recognition with Pre-trained Masked Language Model

ICLFP-NMT: Neural Machine Translation for ICL Flexible Prompt

Non-autoregressive Mandarin-English Code-switching Speech Recognition

Speaker Adaptation for End-to-End CTC Models.

Multilingual Contextual Adapters To Improve Custom Word Recognition In Low-resource Languages