Abstract:Automatic grading is not a new approach but the need to adapt the latest technology to automatic grading has become very important. As the technology has rapidly became more powerful on scoring exams and essays, especially from the 1990s onwards, partially or wholly automated grading systems using computational methods have evolved and have become a major area of research. In particular, the demand of scoring of natural language responses has created a need for tools that can be applied to automatically grade these responses. In this paper, we focus on the concept of automatic grading of short answer questions such as are typical in the UK GCSE system, and providing useful feedback on their answers to students. We present experimental results on a dataset provided from the introductory computer science class in the University of North Texas. We first apply standard data mining techniques to the corpus of student answers for the purpose of measuring similarity between the student answers and the model answer. This is based on the number of common words. We then evaluate the relation between these similarities and marks awarded by scorers. We consider an approach that groups student answers into clusters. Each cluster would be awarded the same mark, and the same feedback given to each answer in a cluster. In this manner, we demonstrate that clusters indicate the groups of students who are awarded the same or the similar scores. Words in each cluster are compared to show that clusters are constructed based on how many and which words of the model answer have been used. The main novelty in this paper is that we design a model to predict marks based on the similarities between the student answers and the model answer. We argue that computational methods be used to enhance the reliability of human scoring, and not replace it. Humans are required to calibrate the system, and to deal with situations that are challenging. Computational methods can provide insight into which student answers will be found challenging and thus be a place human judgement is required.

A Gold Standard Methodology for Evaluating Accuracy in Data-To-Text Systems

Shared Task on Evaluating Accuracy in Natural Language Generation

Correction of Errors in Preference Ratings from Automated Metrics for Text Generation

Toward Human-Like Evaluation for Natural Language Generation with Error Analysis

On the Effectiveness of Automated Metrics for Text Generation Systems

Analysing Data-To-Text Generation Benchmarks

GRUEN for Evaluating Linguistic Quality of Generated Text

Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks

Check-Eval: A Checklist-based Approach for Evaluating Text Quality

Towards Better Open-Ended Text Generation: A Multicriteria Evaluation Framework

Unifying Human and Statistical Evaluation for Natural Language Generation

Automatic short answer grading and feedback using text mining methods

Compression, Transduction, and Creation: A Unified Framework for Evaluating Natural Language Generation

Striking Gold in Advertising: Standardization and Exploration of Ad Text Generation

The statistical advantage of automatic NLG metrics at the system level

Automatic Metrics in Natural Language Generation: A Survey of Current Evaluation Practices

Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation

Dynamic Human Evaluation for Relative Model Comparisons

On Accurate Evaluation of GANs for Language Generation

GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment

Exploring the Use of Large Language Models for Reference-Free Text Quality Evaluation: An Empirical Study