Abstract:We evaluate the effectiveness of Large Language Models (LLMs) in assessing essay quality, focusing on their alignment with human grading. More precisely, we evaluate ChatGPT and Llama in the Automated Essay Scoring (AES) task, a crucial natural language processing (NLP) application in Education. We consider both zero-shot and few-shot learning and different prompting approaches. We compare the numeric grade provided by the LLMs to human rater-provided scores utilizing the ASAP dataset, a well-known benchmark for the AES task. Our research reveals that both LLMs generally assign lower scores compared to those provided by the human raters; moreover, those scores do not correlate well with those provided by the humans. In particular, ChatGPT tends to be harsher and further misaligned with human evaluations than Llama. We also experiment with a number of essay features commonly used by previous AES methods, related to length, usage of connectives and transition words, and readability metrics, including the number of spelling and grammar mistakes. We find that, generally, none of these features correlates strongly with human or LLM scores. Finally, we report results on Llama 3, which are generally better across the board, as expected. Overall, while LLMs do not seem an adequate replacement for human grading, our results are somewhat encouraging for their use as a tool to assist humans in the grading of written essays in the future.

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve This paper aims to evaluate the effectiveness of large language models (LLMs) in the task of automated essay scoring (AES), particularly their consistency with human scoring. Specifically, the researchers assessed the performance of two popular LLMs—ChatGPT and Llama—in the AES task. The focus of the study includes: 1. **Zero-shot and few-shot learning**: The researchers explored how LLMs perform essay scoring with no or only a small amount of training data. 2. **Scoring consistency**: The researchers compared the numerical scores provided by LLMs with those of human raters to evaluate whether the LLMs' scores are consistent with human scores. 3. **Scoring explanation**: The researchers analyzed the scoring explanations provided by LLMs to understand whether these explanations are reasonable and consistent with the scores. 4. **Language error detection**: The researchers examined whether LLMs can correctly identify and evaluate language errors such as spelling and grammar mistakes and reflect these errors in the scores. ### Research Background Traditional essay scoring is primarily done by human raters, but this method faces numerous challenges in the modern educational environment, such as the demand for large-scale remote education, global teacher shortages, long scoring times, and inconsistent scoring. Therefore, automated essay scoring systems (AES) have emerged to evaluate essay quality through automated methods, aiming to improve scoring efficiency and consistency. ### Research Questions 1. **RQ1**: Are human scores consistent with LLM scores? 2. **RQ2**: What are the possible reasons for the similarities or differences in scoring? 3. **RQ3**: Are the explanations provided by LLMs consistent with their scores? 4. **RQ4**: Can LLMs correctly identify spelling and grammar errors and reflect them in the scores? ### Methods The researchers used the Automated Student Assessment Prize (ASAP) dataset, which contains approximately 13,000 essays written by students in grades 7 to 10. The dataset is divided into eight tasks, each corresponding to different prompts and scoring ranges. The researchers selected tasks 1 and 7 for the experiments, which include holistic scoring and trait scoring, respectively. ### Experimental Setup 1. **Prompt design and response generation**: The researchers input the students' essays, scoring criteria, and scoring ranges into ChatGPT and Llama, asking the LLMs to provide a numerical score and a scoring explanation. 2. **Feature extraction**: Various features were extracted from the essays, including the number of sentences, vocabulary, readability, use of transition words, and language errors, to further analyze the differences between LLMs and human raters in the scoring process. ### Conclusion The study found that although LLMs performed well in certain aspects, such as correctly identifying language errors, overall, the correlation between LLM scores and human scores was weak. Additionally, LLMs showed some deficiencies in providing scoring explanations. However, the results still offer a positive outlook for the future use of LLMs to assist human scoring.

Are Large Language Models Good Essay Graders?

Can Large Language Models Automatically Score Proficiency of Written Essays?

Applying Large Language Models for Automated Essay Scoring for Non-Native Japanese

Large Language Models in Student Assessment: Comparing ChatGPT and Human Graders

Human-AI Collaborative Essay Scoring: A Dual-Process Framework with LLMs

Large Language Model as an Assignment Evaluator: Insights, Feedback, and Challenges in a 1000+ Student Course

Using Large Language Models for Automated Grading of Student Writing about Science

Performance of a Large‐Language Model in scoring construction management capstone design projects

Can Large Language Models Make the Grade? An Empirical Study Evaluating LLMs Ability to Mark Short Answer Questions in K-12 Education

Harnessing LLMs for multi-dimensional writing assessment: Reliability and alignment with human judgments

Can Large Language Models Be an Alternative to Human Evaluations?

Large Language Models as Partners in Student Essay Evaluation

Grade Like a Human: Rethinking Automated Assessment with Large Language Models

A Closer Look into Using Large Language Models for Automatic Evaluation

Unleashing Large Language Models' Proficiency in Zero-shot Essay Scoring

Unveiling Scoring Processes: Dissecting the Differences between LLMs and Human Graders in Automatic Scoring

Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization

Analyzing Large Language Models for Classroom Discussion Assessment

Are Large Language Models Reliable Argument Quality Annotators?

Large Language Models As MOOCs Graders

Is GPT-4 Alone Sufficient for Automated Essay Scoring?: A Comparative Judgment Approach Based on Rater Cognition