Enhancing Whisper Model for Pronunciation Assessment with Multi-Adapters

Jing Li,Rui Li,Shen Guo,Aishan Wumaier
DOI: https://doi.org/10.1109/apsipaasc58517.2023.10317374
2023-01-01
Abstract:Automatic pronunciation assessment is an important part of computer-aided pronunciation training. Due to the scarcity of non-native pronunciation assessment datasets and the fact that traditional speech assessment usually uses Goodness of Pronunciation(GOP) features, this may not provide enough information for word or sentence-level assessment. This paper aims to improve the performance of automatic pronunciation assessment from two aspects. First, to alleviate the problem of insufficient training data for pronunciation assessment, we use the weakly supervised learning model Whisper to build a pronunciation assessment model. With the Whisper encoder, Pearson correlation coefficient(PCC) performance is significantly improved compared to traditional acoustic features. Second, we propose a multi-adapters method that uses a multi-task loss to fine-tune the adapter while simultaneously learning phone, word, and sentence assessment tasks to boost sentence-level assessment task performance. In addition, through the experimental comparison of different scale models in the Whisper, the experimental results on the open-source dataset speechocean762 show that our proposed method achieves the best results in the medium.en model.
What problem does this paper attempt to address?