Abstract:Automatic grading is not a new approach but the need to adapt the latest technology to automatic grading has become very important. As the technology has rapidly became more powerful on scoring exams and essays, especially from the 1990s onwards, partially or wholly automated grading systems using computational methods have evolved and have become a major area of research. In particular, the demand of scoring of natural language responses has created a need for tools that can be applied to automatically grade these responses. In this paper, we focus on the concept of automatic grading of short answer questions such as are typical in the UK GCSE system, and providing useful feedback on their answers to students. We present experimental results on a dataset provided from the introductory computer science class in the University of North Texas. We first apply standard data mining techniques to the corpus of student answers for the purpose of measuring similarity between the student answers and the model answer. This is based on the number of common words. We then evaluate the relation between these similarities and marks awarded by scorers. We consider an approach that groups student answers into clusters. Each cluster would be awarded the same mark, and the same feedback given to each answer in a cluster. In this manner, we demonstrate that clusters indicate the groups of students who are awarded the same or the similar scores. Words in each cluster are compared to show that clusters are constructed based on how many and which words of the model answer have been used. The main novelty in this paper is that we design a model to predict marks based on the similarities between the student answers and the model answer. We argue that computational methods be used to enhance the reliability of human scoring, and not replace it. Humans are required to calibrate the system, and to deal with situations that are challenging. Computational methods can provide insight into which student answers will be found challenging and thus be a place human judgement is required.

AI-assisted Automated Short Answer Grading of Handwritten University Level Mathematics Exams

Evaluating GPT-4 at Grading Handwritten Solutions in Math Exams

Using Large Language Models to Assign Partial Credit to Students' Explanations of Problem-Solving Process: Grade at Human Level Accuracy with Grading Confidence Index and Personalized Student-facing Feedback

Performance of the pre-trained large language model GPT-4 on automated short answer grading

Towards Scalable Automated Grading: Leveraging Large Language Models for Conceptual Question Evaluation in Engineering

Grading Assistance for a Handwritten Thermodynamics Exam using Artificial Intelligence: An Exploratory Study

Automatic short answer grading and feedback using text mining methods

Towards LLM-based Autograding for Short Textual Answers

Can Large Language Models Make the Grade? An Empirical Study Evaluating LLMs Ability to Mark Short Answer Questions in K-12 Education

Can AI Assistance Aid in the Grading of Handwritten Answer Sheets?

Beyond human subjectivity and error: a novel AI grading system

Using AI Large Language Models for Grading in Education: A Hands-On Test for Physics

Automated Assessment of Multimodal Answer Sheets in the STEM domain

Performance of a Large‐Language Model in scoring construction management capstone design projects

Automatic Short Math Answer Grading via In-context Meta-learning

Generative Grading: Near Human-level Accuracy for Automated Feedback on Richly Structured Problems

Large Language Models in Student Assessment: Comparing ChatGPT and Human Graders

The model student: GPT-4 performance on graduate biomedical science exams

Towards Trustworthy AutoGrading of Short, Multi-lingual, Multi-type Answers

Can LLMs Grade Short-Answer Reading Comprehension Questions : An Empirical Study with a Novel Dataset

Grade Like a Human: Rethinking Automated Assessment with Large Language Models