Rapid Language Adaptation for Multilingual E2E Speech Recognition Using Encoder Prompting

Yosuke Kashiwagi,Hayato Futami,Emiru Tsunoo,Siddhant Arora,Shinji Watanabe
2024-06-18
Abstract:End-to-end multilingual speech recognition models handle multiple languages through a single model, often incorporating language identification to automatically detect the language of incoming speech. Since the common scenario is where the language is already known, these models can perform as language-specific by using language information as prompts, which is particularly beneficial for attention-based encoder-decoder architectures. However, the Connectionist Temporal Classification (CTC) approach, which enhances recognition via joint decoding and multi-task training, does not normally incorporate language prompts due to its conditionally independent output tokens. To overcome this, we introduce an encoder prompting technique within the self-conditioned CTC framework, enabling language-specific adaptation of the CTC model in a zero-shot manner. Our method has shown to significantly reduce errors by 28% on average and by 41% on low-resource languages.
Sound,Computation and Language,Audio and Speech Processing
What problem does this paper attempt to address?
This paper attempts to solve an important problem in end - to - end multilingual speech recognition (E2E multilingual speech recognition): how to improve the accuracy of speech recognition by providing language prompts when the input language is known. Specifically, the paper focuses on how to effectively incorporate language prompt information into the model based on Connectionist Temporal Classification (CTC). ### Main problems of the paper 1. **Language - specific adaptability in multilingual speech recognition**: - In multilingual speech recognition tasks, the language is usually known, so the model can be made more focused on the recognition of a specific language by providing language prompts. - Traditional attention - based encoder - decoder architectures can significantly improve performance by providing language IDs as prompts, but the CTC method cannot directly utilize language prompts due to its conditional independence assumption. 2. **Challenges of language prompts in CTC models**: - The output of the CTC model is conditionally independent at each time step, which means that it cannot dynamically adjust the output according to the context like the attention mechanism. - Therefore, when dealing with multilingual tasks, the CTC model has difficulty in improving the recognition effect through simple language prompts. ### Solutions To solve the above problems, the paper proposes a new encoder - prompting technique based on the self - conditioned CTC (SC - CTC) framework. The specific steps are as follows: 1. **Introducing the SC - CTC framework**: - SC - CTC alleviates the conditional independence assumption of CTC by calculating the CTC loss in the intermediate layer of the encoder and feeding the intermediate prediction results back to the next layer. 2. **Encoder prompting**: - In the inference stage, language prompt information is incorporated into the encoder by modifying the output probabilities of the intermediate layer. Specifically, the paper proposes three modification methods: - **Replacement**: Only modify the frames with the highest language ID probability. - **Aggregation**: Aggregate the language ID probabilities in all frames to the target language ID. - **Prefix**: Only modify the first few frames to represent the prompt information. 3. **Soft Prompting**: - When the input language can be one of multiple languages, a soft prompting method is proposed to allow the model to adapt according to multiple candidate languages. ### Experimental results The paper verifies the effectiveness of the proposed method through experiments on multiple large - scale multilingual datasets such as Common Voice, VoxForge and FLEURS. The results show that this method reduces the average relative error rate by 28%, and for languages with extremely scarce resources (less than 5 hours of training data), the error rate is reduced by 41%. ### Conclusion The paper proposes a new multilingual speech recognition adaptation technique, which realizes rapid adaptation to known input languages through encoder prompting. This method not only significantly improves the recognition performance but also can be flexibly applied in a multilingual environment, especially for languages with scarce resources.