Abstract:Open-ended questions, which require students to produce multi-word, nontrivial responses, are a popular tool for formative assessment as they provide more specific insights into what students do and don't know. However, grading open-ended questions can be time-consuming leading teachers to resort to simpler question formats or conduct fewer formative assessments. While there has been a longstanding interest in automating of short-answer grading (ASAG), but previous approaches have been technically complex, limiting their use in formative assessment contexts. The newest generation of Large Language Models (LLMs) potentially makes grading short answer questions more feasible. This paper investigates the potential for the newest version of LLMs to be used in ASAG, specifically in the grading of short answer questions for formative assessments, in two ways. First, it introduces a novel dataset of short answer reading comprehension questions, drawn from a set of reading assessments conducted with over 150 students in Ghana. This dataset allows for the evaluation of LLMs in a new context, as they are predominantly designed and trained on data from high-income North American countries. Second, the paper empirically evaluates how well various configurations of generative LLMs grade student short answer responses compared to expert human raters. The findings show that GPT-4, with minimal prompt engineering, performed extremely well on grading the novel dataset (QWK 0.92, F1 0.89), reaching near parity with expert human raters. To our knowledge this work is the first to empirically evaluate the performance of generative LLMs on short answer reading comprehension questions using real student data, with low technical hurdles to attaining this performance. These findings suggest that generative LLMs could be used to grade formative literacy assessment tasks.

LLMs in Open and Closed Book Examinations in a Final Year Applied Machine Learning Course (early Findings)

Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena

Studying LLM Performance on Closed- and Open-source Data

I don't trust you (anymore)! -- The effect of students' LLM use on Lecturer-Student-Trust in Higher Education

Assessing Hidden Risks of LLMs: An Empirical Study on Robustness, Consistency, and Credibility

Comprehensive Reassessment of Large-Scale Evaluation Outcomes in LLMs: A Multifaceted Statistical Approach

Understanding LLM Development Through Longitudinal Study: Insights from the Open Ko-LLM Leaderboard

Awes, Laws, and Flaws From Today's LLM Research

Can LLMs Grade Short-Answer Reading Comprehension Questions : An Empirical Study with a Novel Dataset

Breaking the Silence: the Threats of Using LLMs in Software Engineering

Limitations of the LLM-as-a-Judge Approach for Evaluating LLM Outputs in Expert Knowledge Tasks

Insights from Social Shaping Theory: The Appropriation of Large Language Models in an Undergraduate Programming Course

"Which LLM should I use?": Evaluating LLMs for tasks performed by Undergraduate Computer Science Students

An Empirical Study on Challenges for LLM Application Developers

"With Great Power Comes Great Responsibility!": Student and Instructor Perspectives on the influence of LLMs on Undergraduate Engineering Education

The Potential of LLMs in Medical Education: Generating Questions and Answers for Qualification Exams

Student Perspectives on Using a Large Language Model (LLM) for an Assignment on Professional Ethics

Legal aspects of generative artificial intelligence and large language models in examinations and theses

Automated Feedback in Math Education: A Comparative Analysis of LLMs for Open-Ended Responses

SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading

LLMs are Biased Teachers: Evaluating LLM Bias in Personalized Education