Advances in Cantonese Speech Recognition: A Language-Specific Pretraining Model and RNN-T Loss
Junyun Guan,Minqiang Xu,Xuan,Lei Fang,Yihao Chen,Liang He
DOI: https://doi.org/10.1109/iaecst60924.2023.10503177
2023-01-01
Abstract:Cantonese is one of the commonly used Chinese dialects, yet there is a current lack of effective research on Cantonese speech recognition. Moreover, the limited Cantonese data results in its speech recognition performance being relatively inferior compared to other languages. Addressing the above situation and the issues, this paper proposes a Cantonese speech recognition framework based on unsupervised language-specific pre-training representation to advance the current Cantonese speech recognition method and improve the performance. Initially, an unlabeled Cantonese dataset has been collected and a Cantonese-specific pre-training model with wav2vec2.0 framework has been conducted. Our pre-training model, using only 2,000 hours of Cantonese data, outperformed the XLSR-53’s 56,000 hours with a 6% relative improvement in recognition. Supervised training schemes were subsequently developed, incorporating RNN-T and CTC loss functions for joint training. Ultimately, based on these approaches, a Cantonese speech recognition system suitable for low-resource environments was realized. Finally, on the open-source Cantonese test set, the proposed approach achieved a CER of 15.57%, a significant reduction of 14.61% compared to the 30.18% CER of the conformer end-to-end approach, demonstrating a clear performance advantage and validating the effectiveness of our proposed method.