Abstract:Recent advancements in natural language processing, computational linguistics, and Artificial Intelligence (AI) have propelled the use of Large Language Models (LLMs) in Automated Essay Scoring (AES), offering efficient and unbiased writing assessment. This study assesses the reliability of LLMs in AES tasks, focusing on scoring consistency and alignment with human raters. We explore the impact of prompt engineering, temperature settings, and multi-level rating dimensions on the scoring performance of LLMs. Results indicate that prompt engineering significantly affects the reliability of LLMs, with GPT-4 showing marked improvement over GPT-3.5 and Claude 2, achieving 112% and 114% increase in scoring accuracy under the criteria and sample-referenced justification prompt. Temperature settings also influence the output consistency of LLMs, with lower temperatures producing scores more in line with human evaluations, which is essential for maintaining fairness in large-scale assessment. Regarding multi-dimensional writing assessment, results indicate that GPT-4 performs well in dimensions regarding Ideas (QWK=0.551) and Organization (QWK=0.584) under well-crafted prompt engineering. These findings pave the way for a comprehensive exploration of LLMs' broader educational implications, offering insights into their capability to refine and potentially transform writing instruction, assessment, and the delivery of diagnostic and personalized feedback in the AI-powered educational age. While this study attached importance to the reliability and alignment of LLM-powered multi-dimensional AES, future research should broaden its scope to encompass diverse writing genres and a more extensive sample from varied backgrounds.

LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts

Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form Text

Evaluating the Evaluator: Measuring LLMs' Adherence to Task Evaluation Instructions

Human-Centered Design Recommendations for LLM-as-a-Judge

Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks

How Reliable Are Automatic Evaluation Methods for Instruction-Tuned LLMs?

HumanRankEval: Automatic Evaluation of LMs as Conversational Assistants

Finding Blind Spots in Evaluator LLMs with Interpretable Checklists

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena

Unveiling Scoring Processes: Dissecting the Differences between LLMs and Human Graders in Automatic Scoring

Assessing the Performance of Human-Capable LLMs -- Are LLMs Coming for Your Job?

Calibrating LLM-Based Evaluator

Can Large Language Models Be an Alternative to Human Evaluations?

Harnessing LLMs for multi-dimensional writing assessment: Reliability and alignment with human judgments

Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences

Limitations of the LLM-as-a-Judge Approach for Evaluating LLM Outputs in Expert Knowledge Tasks

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks

Style Over Substance: Evaluation Biases for Large Language Models

Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates