Abstract:Automatic grading is not a new approach but the need to adapt the latest technology to automatic grading has become very important. As the technology has rapidly became more powerful on scoring exams and essays, especially from the 1990s onwards, partially or wholly automated grading systems using computational methods have evolved and have become a major area of research. In particular, the demand of scoring of natural language responses has created a need for tools that can be applied to automatically grade these responses. In this paper, we focus on the concept of automatic grading of short answer questions such as are typical in the UK GCSE system, and providing useful feedback on their answers to students. We present experimental results on a dataset provided from the introductory computer science class in the University of North Texas. We first apply standard data mining techniques to the corpus of student answers for the purpose of measuring similarity between the student answers and the model answer. This is based on the number of common words. We then evaluate the relation between these similarities and marks awarded by scorers. We consider an approach that groups student answers into clusters. Each cluster would be awarded the same mark, and the same feedback given to each answer in a cluster. In this manner, we demonstrate that clusters indicate the groups of students who are awarded the same or the similar scores. Words in each cluster are compared to show that clusters are constructed based on how many and which words of the model answer have been used. The main novelty in this paper is that we design a model to predict marks based on the similarities between the student answers and the model answer. We argue that computational methods be used to enhance the reliability of human scoring, and not replace it. Humans are required to calibrate the system, and to deal with situations that are challenging. Computational methods can provide insight into which student answers will be found challenging and thus be a place human judgement is required.

A Step Towards Mixture of Grader: Statistical Analysis of Existing Automatic Evaluation Metrics

QGEval: Benchmarking Multi-dimensional Evaluation for Question Generation

QAScore -- An Unsupervised Unreferenced Metric for the Question Generation Evaluation

Automatic short answer grading and feedback using text mining methods

Convergences and Divergences between Automatic Assessment and Human Evaluation: Insights from Comparing ChatGPT-Generated Translation and Neural Machine Translation

Rethinking the Evaluation of Unbiased Scene Graph Generation

Toward Human-Like Evaluation for Natural Language Generation with Error Analysis

Evaluating Open-QA Evaluation

Style Over Substance: Evaluation Biases for Large Language Models

Beyond correlation: The impact of human uncertainty in measuring the effectiveness of automatic evaluation and LLM-as-a-judge

A Statistical Analysis of Summarization Evaluation Metrics Using Resampling Methods

CFMatch: Aligning Automated Answer Equivalence Evaluation with Expert Judgments For Open-Domain Question Answering

MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures

Modeling and Analyzing Scorer Preferences in Short-Answer Math Questions

OpinSummEval: Revisiting Automated Evaluation for Opinion Summarization

PEDANTS: Cheap but Effective and Interpretable Answer Equivalence

Translation Quality Assessment: A Brief Survey on Manual and Automatic Methods

Grade Like a Human: Rethinking Automated Assessment with Large Language Models

Revisiting Automatic Question Summarization Evaluation in the Biomedical Domain

Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics

An Investigation of Evaluation Metrics for Automated Medical Note Generation