Are Large Language Models Good Essay Graders?

Anindita Kundu,Denilson Barbosa
2024-09-20
Abstract:We evaluate the effectiveness of Large Language Models (LLMs) in assessing essay quality, focusing on their alignment with human grading. More precisely, we evaluate ChatGPT and Llama in the Automated Essay Scoring (AES) task, a crucial natural language processing (NLP) application in Education. We consider both zero-shot and few-shot learning and different prompting approaches. We compare the numeric grade provided by the LLMs to human rater-provided scores utilizing the ASAP dataset, a well-known benchmark for the AES task. Our research reveals that both LLMs generally assign lower scores compared to those provided by the human raters; moreover, those scores do not correlate well with those provided by the humans. In particular, ChatGPT tends to be harsher and further misaligned with human evaluations than Llama. We also experiment with a number of essay features commonly used by previous AES methods, related to length, usage of connectives and transition words, and readability metrics, including the number of spelling and grammar mistakes. We find that, generally, none of these features correlates strongly with human or LLM scores. Finally, we report results on Llama 3, which are generally better across the board, as expected. Overall, while LLMs do not seem an adequate replacement for human grading, our results are somewhat encouraging for their use as a tool to assist humans in the grading of written essays in the future.
Computation and Language,Artificial Intelligence,Computers and Society
What problem does this paper attempt to address?
### The Problem the Paper Attempts to Solve This paper aims to evaluate the effectiveness of large language models (LLMs) in the task of automated essay scoring (AES), particularly their consistency with human scoring. Specifically, the researchers assessed the performance of two popular LLMs—ChatGPT and Llama—in the AES task. The focus of the study includes: 1. **Zero-shot and few-shot learning**: The researchers explored how LLMs perform essay scoring with no or only a small amount of training data. 2. **Scoring consistency**: The researchers compared the numerical scores provided by LLMs with those of human raters to evaluate whether the LLMs' scores are consistent with human scores. 3. **Scoring explanation**: The researchers analyzed the scoring explanations provided by LLMs to understand whether these explanations are reasonable and consistent with the scores. 4. **Language error detection**: The researchers examined whether LLMs can correctly identify and evaluate language errors such as spelling and grammar mistakes and reflect these errors in the scores. ### Research Background Traditional essay scoring is primarily done by human raters, but this method faces numerous challenges in the modern educational environment, such as the demand for large-scale remote education, global teacher shortages, long scoring times, and inconsistent scoring. Therefore, automated essay scoring systems (AES) have emerged to evaluate essay quality through automated methods, aiming to improve scoring efficiency and consistency. ### Research Questions 1. **RQ1**: Are human scores consistent with LLM scores? 2. **RQ2**: What are the possible reasons for the similarities or differences in scoring? 3. **RQ3**: Are the explanations provided by LLMs consistent with their scores? 4. **RQ4**: Can LLMs correctly identify spelling and grammar errors and reflect them in the scores? ### Methods The researchers used the Automated Student Assessment Prize (ASAP) dataset, which contains approximately 13,000 essays written by students in grades 7 to 10. The dataset is divided into eight tasks, each corresponding to different prompts and scoring ranges. The researchers selected tasks 1 and 7 for the experiments, which include holistic scoring and trait scoring, respectively. ### Experimental Setup 1. **Prompt design and response generation**: The researchers input the students' essays, scoring criteria, and scoring ranges into ChatGPT and Llama, asking the LLMs to provide a numerical score and a scoring explanation. 2. **Feature extraction**: Various features were extracted from the essays, including the number of sentences, vocabulary, readability, use of transition words, and language errors, to further analyze the differences between LLMs and human raters in the scoring process. ### Conclusion The study found that although LLMs performed well in certain aspects, such as correctly identifying language errors, overall, the correlation between LLM scores and human scores was weak. Additionally, LLMs showed some deficiencies in providing scoring explanations. However, the results still offer a positive outlook for the future use of LLMs to assist human scoring.