Abstract:People have long desired intelligent conversational systems that can provide assistance in practical scenarios. The latest advancements in large language models (LLMs) are making significant strides toward turning this aspiration into a tangible reality. LLMs are believed to hold the most potential and value in education, especially in the creation of AI-driven virtual teachers that facilitate language learning. This study focuses on assessing the effectiveness of LLMs within the educational domain, specifically in the areas of spoken language learning, which encompass phonetics, phonology, and second language acquisition. To this end, we first introduced a new multiple-choice question dataset to evaluate the effectiveness of LLMs in the aforementioned scenarios, including the understanding and application of spoken language knowledge. Moreover, we investigated the influence of various prompting techniques such as zero- and few-shot methods (prepending the question with question-answer exemplars), chain-of-thought (CoT) prompting, in-domain exemplars, and external tools. We conducted a comprehensive evaluation of popular LLMs (20 distinct models) using these methods. The experimental results showed that the task of extracting conceptual knowledge posed few challenges for these LLMs, whereas the task of application questions was relatively difficult. In addition, some widely proven effective prompting methods combined with domain-specific examples resulted in significant performance improvements compared to the zero-shot baselines. Additionally, some other preliminary experiments also demonstrated the strengths and weaknesses of different LLMs. The findings of this study can shed light on the application of LLMs to spoken language learning.

LLM as a Scorer: The Impact of Output Order on Dialogue Evaluation

A Better LLM Evaluator for Text Generation: The Impact of Prompt Output Sequencing and Optimization

Applying Large Language Models for Automated Essay Scoring for Non-Native Japanese

Leveraging LLMs for Dialogue Quality Measurement

Which is better? Exploring Prompting Strategy For LLM-based Metrics

Scoring with Large Language Models: A Study on Measuring Empathy of Responses in Dialogues

Self-Explanation Prompting Improves Dialogue Understanding in Large Language Models

DialogBench: Evaluating LLMs as Human-like Dialogue Systems

Pronunciation Assessment with Multi-modal Large Language Models

EvalLM: Interactive Evaluation of Large Language Model Prompts on User-Defined Criteria

A Comprehensive Analysis of the Effectiveness of Large Language Models As Automatic Dialogue Evaluators

Evaluating the Evaluator: Measuring LLMs' Adherence to Task Evaluation Instructions

Exploring the Dialogue Comprehension Ability of Large Language Models

CoPrompter: User-Centric Evaluation of LLM Instruction Alignment for Improved Prompt Engineering

Self-Preference Bias in LLM-as-a-Judge

Toward the Evaluation of Large Language Models Considering Score Variance across Instruction Templates

Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates

Large Language Model based Situational Dialogues for Second Language Learning

Evoke: Evoking Critical Thinking Abilities in LLMs via Reviewer-Author Prompt Editing

An Investigation of Applying Large Language Models to Spoken Language Learning

Evaluating Large Language Models at Evaluating Instruction Following