Bridging Speech and Text: Enhancing ASR with Pinyin-to-Character Pre-training in LLMs

Yang Yuhang,Peng Yizhou,Eng Siong Chng,Xionghu Zhong
2024-09-24
Abstract:The integration of large language models (LLMs) with pre-trained speech models has opened up new avenues in automatic speech recognition (ASR). While LLMs excel in multimodal understanding tasks, effectively leveraging their capabilities for ASR remains a significant challenge. This paper presents a novel training approach to enhance LLM performance in ASR tasks. We propose pre-training LLMs on Pinyin embedding sequences, which represent pronunciation features, to generate corresponding Chinese characters. This step enables the LLM to adapt to generating text from pronunciation features before encountering real speech data. Furthermore, we fine-tune the LoRA parameters to enhance the LLM's understanding of speech modality information. In AISHELL-1 corpus, our approach yields a 9.5% relative improvement in ASR tasks compared to the baseline without Pinyi-to-Character pre-training. Additionally, incorporating auxiliary text data for Pinyi-to-Character pre-training further boosts performance, achieving a 19.0% relative improvement.
Computation and Language,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively utilize large - language models (LLMs) to enhance the performance of automatic speech recognition (ASR) systems. Although LLMs perform excellently in multimodal understanding tasks, how to effectively utilize their capabilities to improve the performance of ASR systems remains a challenge. Specifically, the paper proposes a two - stage training method: 1. **Stage One**: Through pre - training LLMs to generate corresponding Chinese characters from pinyin embedding sequences, enabling the model to generate text based on pronunciation features. This step allows LLMs to adapt to the task of generating text from pronunciation features before being exposed to real - voice data. 2. **Stage Two**: By fine - tuning low - rank adaptation (LoRA) parameters, further enhance LLMs' understanding of voice - modality information. Specifically, use a pre - trained audio model to extract audio features and input them into LLMs for training to generate corresponding text sequences. Through this method, the experimental results of the paper on the AISHELL - 1 corpus show that, compared with the baseline model without pinyin - to - character pre - training, this method relatively improves the performance by 9.5% on the ASR task. In addition, combining auxiliary text data for pinyin - to - character pre - training further improves the performance, reaching a relative improvement of 19.0%. In conclusion, the main contribution of this paper lies in proposing a new pre - training method, which enhances the performance of LLMs in ASR tasks by using pinyin embedding sequences as a bridge between voice and text modalities.