Pronunciation Assessment with Multi-modal Large Language Models

Kaiqi Fu,Linkai Peng,Nan Yang,Shuran Zhou
2024-07-18
Abstract:Large language models (LLMs), renowned for their powerful conversational abilities, are widely recognized as exceptional tools in the field of education, particularly in the context of automated intelligent instruction systems for language learning. In this paper, we propose a scoring system based on LLMs, motivated by their positive impact on text-related scoring tasks. Specifically, the speech encoder first maps the learner's speech into contextual features. The adapter layer then transforms these features to align with the text embedding in latent space. The assessment task-specific prefix and prompt text are embedded and concatenated with the features generated by the modality adapter layer, enabling the LLMs to predict accuracy and fluency scores. Our experiments demonstrate that the proposed scoring systems achieve competitive results compared to the baselines on the Speechocean762 datasets. Moreover, we also conducted an ablation study to better understand the contributions of the prompt text and training strategy in the proposed scoring system.
Computation and Language,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to use multimodal large language models (LLMs) to evaluate learners' pronunciation in language learning, especially how to score sentence - level pronunciation accuracy and fluency in the shadowing scenario. Traditional pronunciation evaluation methods usually rely on alignment techniques, such as the deep neural network - hidden Markov model (DNN - HMM) - based automatic speech recognition (ASR) model. These methods require a large amount of labeled data and have certain limitations when dealing with non - native speakers' pronunciation. However, this paper proposes a multimodal model without alignment, aiming to simplify the feature extraction process and improve the accuracy and efficiency of pronunciation evaluation. Specifically, the main contributions of the paper include: - Proposing for the first time a pronunciation evaluation system based on multimodal large language models, which can directly extract features from raw audio and text inputs and predict pronunciation accuracy and fluency. - The proposed method belongs to the category of non - alignment systems and shows competitive performance compared with traditional alignment systems and non - alignment systems in experiments. - In the first - stage training, the proposed multimodal system achieves state - of - the - art (SOTA) results on the ASR task, which indicates the effectiveness of this method. Through these innovations, the paper aims to provide language learners with a more efficient and accurate pronunciation evaluation tool, thereby helping them improve their pronunciation skills.