Abstract:Relying on human experts to evaluate CEFR speaking assessments in an e-learning environment creates scalability challenges, as it limits how quickly and widely assessments can be conducted. We aim to automate the evaluation of CEFR B2 English speaking assessments in e-learning environments from conversation transcripts. First, we evaluate the capability of leading open source and commercial Large Language Models (LLMs) to score a candidate's performance across various criteria in the CEFR B2 speaking exam in both global and India-specific contexts. Next, we create a new expert-validated, CEFR-aligned synthetic conversational dataset with transcripts that are rated at different assessment scores. In addition, new instruction-tuned datasets are developed from the English Vocabulary Profile (up to CEFR B2 level) and the CEFR-SP WikiAuto datasets. Finally, using these new datasets, we perform parameter efficient instruction tuning of Mistral Instruct 7B v0.2 to develop a family of models called EvalYaks. Four models in this family are for assessing the four sections of the CEFR B2 speaking exam, one for identifying the CEFR level of vocabulary and generating level-specific vocabulary, and another for detecting the CEFR level of text and generating level-specific text. EvalYaks achieved an average acceptable accuracy of 96%, a degree of variation of 0.35 levels, and performed 3 times better than the next best model. This demonstrates that a 7B parameter LLM instruction tuned with high-quality CEFR-aligned assessment data can effectively evaluate and score CEFR B2 English speaking assessments, offering a promising solution for scalable, automated language proficiency evaluation.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to automate the assessment of CEFR B2 spoken English test scores in e - learning environments, in order to reduce the reliance on human experts. Specifically, the research objectives include: 1. **Develop an automated assessment system**: Aim to create a system that can automatically assess CEFR B2 spoken English test scores without the involvement of human assessors and is relevant in both international and India - specific contexts. 2. **Identify vocabulary and proficiency levels**: Develop an automated system that can identify vocabulary and language proficiency levels. 3. **Generate CEFR B2 - level vocabulary and sentences**: The system should also be able to generate vocabulary and sentences that meet the CEFR B2 level. To achieve these goals, the researchers took the following steps: - **Evaluate the capabilities of existing models**: First, the researchers evaluated the ability of existing open - source and commercial large - language models (LLMs) to score CEFR B2 spoken English tests under different assessment criteria, including their performance in global and India - specific contexts. - **Create a new dataset**: Next, they created a new, expert - verified, CEFR - aligned synthetic conversation dataset that contains conversation records with different assessment scores. In addition, new instruction - tuning datasets were developed from the English Vocabulary Profile (up to CEFR B2 level) and the CEFR - SP WikiAuto dataset. - **Model tuning**: Using these new datasets, the researchers performed parameter - efficient instruction - tuning on Mistral Instruct 7B v0.2 and developed a series of models named EvalYaks. These models include four models for assessing different parts of the CEFR B2 spoken English test, one for identifying CEFR - level vocabulary and generating vocabulary at the corresponding level, and another for detecting CEFR - level text and generating text at the corresponding level. Through these steps, the researchers hope to demonstrate that a 7B - parameter LLM instruction - tuned with high - quality CEFR - aligned assessment data can effectively assess and score CEFR B2 spoken English tests, thereby providing a scalable automated language proficiency assessment solution.

EvalYaks: Instruction Tuning Datasets and LoRA Fine-tuned Models for Automated Scoring of CEFR B2 Speaking Assessment Transcripts

Automated Speech Scoring System Under The Lens: Evaluating and interpreting the linguistic cues for language proficiency

Assessing Fine-Tuning Efficacy in LLMs: A Case Study with Learning Guidance Chatbots

\llinstruct: An Instruction-tuned model for English Language Proficiency Assessments

Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs

Finding Blind Spots in Evaluator LLMs with Interpretable Checklists

Investigating Automatic Scoring and Feedback using Large Language Models

HumanRankEval: Automatic Evaluation of LMs as Conversational Assistants

F-Eval: Asssessing Fundamental Abilities with Refined Evaluation Methods

F-Eval: Assessing Fundamental Abilities with Refined Evaluation Methods

Towards Large Language Model driven Reference-less Translation Evaluation for English and Indian Languages

TencentLLMEval: A Hierarchical Evaluation of Real-World Capabilities for Human-Aligned LLMs

LACIE: Listener-Aware Finetuning for Confidence Calibration in Large Language Models

DELIA: Diversity-Enhanced Learning for Instruction Adaptation in Large Language Models

Towards automatic assessment of spontaneous spoken English

TALEC: Teach Your LLM to Evaluate in Specific Domain with In-house Criteria by Criteria Division and Zero-shot Plus Few-shot

INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large Language Models

HumanELY: Human evaluation of LLM yield, using a novel web-based evaluation tool

Calibrating LLM-Based Evaluator

HREF: Human Response-Guided Evaluation of Instruction Following in Language Models

Automated speech scoring of dialogue response by Japanese learners of English as a foreign language