Attention-CNN Combined with Multi-Layer Feature Fusion for English L2 Multi-Granularity Pronunciation Assessment

Jianlei Yang,Aishan Wumaier,Zaokere Kadeer,Liejun Wang,Shen Guo,Jing Li
DOI: https://doi.org/10.1109/prml59573.2023.10348349
2023-01-01
Abstract:Computer-Assisted Pronunciation Training (CAPT) is a technique that can help learners correct mispronunciations in the target language and improve their oral language skills. Many studies have used multiple-task learning based on the goodness of pronunciation (GOP) characteristics to evaluate multiple dimensions of the data to be measured, but there is no in-depth research on how to balance the relationship between each task and how to make full use of the, relatively lacking, manual annotation of the assessment data. In order to solve the above problems, we have made three main contributions. Firstly, we use a neural network framework combining Multi-Head Self-Attention and CNN to obtain local and global features of the input data. Secondly, in order to make full use of the input data, we propose to use a multi-layer feature fusion method. Finally, we employ the idea of multi-task loss weight optimization to balance each task’s relationship. The experimental results, compared with the baseline model for all tasks, showed some improvement, with the word-level accuracy PCC metric reaching 58.5% (9.8% improvement), the total score PCC metric reaching 60.4% (9.3% improvement), and the phenome completeness PCC metric reaching 27.7% (the highest in the baseline model was 15.5%), which is difficult to assess.
What problem does this paper attempt to address?