An efficient text augmentation approach for contextualized Mandarin speech recognition

Naijun Zheng,Xucheng Wan,Kai Liu,Ziqing Du,Zhou Huan
2024-06-14
Abstract:Although contextualized automatic speech recognition (ASR) systems are commonly used to improve the recognition of uncommon words, their effectiveness is hindered by the inherent limitations of speech-text data availability. To address this challenge, our study proposes to leverage extensive text-only datasets and contextualize pre-trained ASR models using a straightforward text-augmentation (TA) technique, all while keeping computational costs minimal. In particular, to contextualize a pre-trained CIF-based ASR, we construct a codebook using limited speech-text data. By utilizing a simple codebook lookup process, we convert available text-only data into latent text embeddings. These embeddings then enhance the inputs for the contextualized ASR. Our experiments on diverse Mandarin test sets demonstrate that our TA approach significantly boosts recognition performance. The top-performing system shows relative CER improvements of up to 30% on rare words and 15% across all words in general.
Sound,Computation and Language,Audio and Speech Processing
What problem does this paper attempt to address?
This paper attempts to solve the performance problems of automatic speech recognition (ASR) systems when dealing with rare words. Specifically, although the existing contextualized ASR systems can improve the recognition effect of uncommon words, their effectiveness is limited by the availability of speech - text data pairs. To address this challenge, this paper proposes a method of using large - scale plain - text data sets and contextualizes the pre - trained ASR model through a simple text augmentation (TA) technique while keeping the computational cost to a minimum. ### Main problems and solutions 1. **Problems**: - The existing contextualized ASR systems rely on speech - text data pairs, and the limited number of these data pairs leads to poor performance of the system when dealing with rare words. - It is necessary to improve the recognition performance of the ASR system under the condition of limited computational resources, especially when dealing with rare words. 2. **Solutions**: - **Utilizing plain - text data**: Enhance the ASR model by using a large amount of plain - text data, which is easier to obtain and has a wider vocabulary coverage. - **Text augmentation (TA) technique**: Introduce a simple text augmentation method to convert plain - text data into latent text embeddings, thereby enhancing the input of the ASR model. - **Codebook construction**: Use a small amount of speech - text paired data to construct a codebook and convert plain - text data into latent text embeddings through a look - up table process. - **Multimodal alignment**: Achieve implicit alignment of speech and text through the CIF mechanism, avoiding the need for additional features. ### Experimental results The experimental results show that the proposed TA method significantly improves the recognition performance of the ASR system, especially when dealing with rare words. Specifically, the best system has an improvement in the relative character error rate (CER) of up to 30% for rare words, and an improvement of 15% for all words. ### Summary This paper proposes a simple and effective text augmentation method that can fully utilize plain - text data to improve the performance of the contextualized ASR system under the condition of limited computational resources, especially with a significant improvement in the recognition effect of rare words.