Pronunciation Assessment with Multi-modal Large Language Models

Kaiqi Fu,Linkai Peng,Nan Yang,Shuran Zhou

2024-07-18

Abstract:Large language models (LLMs), renowned for their powerful conversational abilities, are widely recognized as exceptional tools in the field of education, particularly in the context of automated intelligent instruction systems for language learning. In this paper, we propose a scoring system based on LLMs, motivated by their positive impact on text-related scoring tasks. Specifically, the speech encoder first maps the learner's speech into contextual features. The adapter layer then transforms these features to align with the text embedding in latent space. The assessment task-specific prefix and prompt text are embedded and concatenated with the features generated by the modality adapter layer, enabling the LLMs to predict accuracy and fluency scores. Our experiments demonstrate that the proposed scoring systems achieve competitive results compared to the baselines on the Speechocean762 datasets. Moreover, we also conducted an ablation study to better understand the contributions of the prompt text and training strategy in the proposed scoring system.

Computation and Language,Audio and Speech Processing

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to use multimodal large language models (LLMs) to evaluate learners' pronunciation in language learning, especially how to score sentence - level pronunciation accuracy and fluency in the shadowing scenario. Traditional pronunciation evaluation methods usually rely on alignment techniques, such as the deep neural network - hidden Markov model (DNN - HMM) - based automatic speech recognition (ASR) model. These methods require a large amount of labeled data and have certain limitations when dealing with non - native speakers' pronunciation. However, this paper proposes a multimodal model without alignment, aiming to simplify the feature extraction process and improve the accuracy and efficiency of pronunciation evaluation. Specifically, the main contributions of the paper include: - Proposing for the first time a pronunciation evaluation system based on multimodal large language models, which can directly extract features from raw audio and text inputs and predict pronunciation accuracy and fluency. - The proposed method belongs to the category of non - alignment systems and shows competitive performance compared with traditional alignment systems and non - alignment systems in experiments. - In the first - stage training, the proposed multimodal system achieves state - of - the - art (SOTA) results on the ASR task, which indicates the effectiveness of this method. Through these innovations, the paper aims to provide language learners with a more efficient and accurate pronunciation evaluation tool, thereby helping them improve their pronunciation skills.

Pronunciation Assessment with Multi-modal Large Language Models

Spoken Language Intelligence of Large Language Models for Language Learning

Applying Large Language Models for Automated Essay Scoring for Non-Native Japanese

WavLLM: Towards Robust and Adaptive Speech Large Language Model

Using Large Language Model for End-to-End Chinese ASR and NER

Prompting Large Language Models with Speech Recognition Abilities

Analyzing Large Language Models for Classroom Discussion Assessment

Generative Speech Recognition Error Correction with Large Language Models and Task-Activating Prompting

A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition

A Survey on Speech Large Language Models

LLaSM: Large Language and Speech Model

Tuning Large language model for End-to-end Speech Translation

Multi-Modal Multi-Scale Speech Expression Evaluation In Computer-Assisted Language Learning

Self-Powered LLM Modality Expansion for Large Speech-Text Models

Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation

Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learners

Adapting Large Language Model with Speech for Fully Formatted End-to-End Speech Recognition