Abstract:Recent advancements in natural language processing, computational linguistics, and Artificial Intelligence (AI) have propelled the use of Large Language Models (LLMs) in Automated Essay Scoring (AES), offering efficient and unbiased writing assessment. This study assesses the reliability of LLMs in AES tasks, focusing on scoring consistency and alignment with human raters. We explore the impact of prompt engineering, temperature settings, and multi-level rating dimensions on the scoring performance of LLMs. Results indicate that prompt engineering significantly affects the reliability of LLMs, with GPT-4 showing marked improvement over GPT-3.5 and Claude 2, achieving 112% and 114% increase in scoring accuracy under the criteria and sample-referenced justification prompt. Temperature settings also influence the output consistency of LLMs, with lower temperatures producing scores more in line with human evaluations, which is essential for maintaining fairness in large-scale assessment. Regarding multi-dimensional writing assessment, results indicate that GPT-4 performs well in dimensions regarding Ideas (QWK=0.551) and Organization (QWK=0.584) under well-crafted prompt engineering. These findings pave the way for a comprehensive exploration of LLMs' broader educational implications, offering insights into their capability to refine and potentially transform writing instruction, assessment, and the delivery of diagnostic and personalized feedback in the AI-powered educational age. While this study attached importance to the reliability and alignment of LLM-powered multi-dimensional AES, future research should broaden its scope to encompass diverse writing genres and a more extensive sample from varied backgrounds.

Exploring LLM Prompting Strategies for Joint Essay Scoring and Feedback Generation

Human-AI Collaborative Essay Scoring: A Dual-Process Framework with LLMs

A Better LLM Evaluator for Text Generation: The Impact of Prompt Output Sequencing and Optimization

Can Large Language Models Automatically Score Proficiency of Written Essays?

Applying Large Language Models for Automated Essay Scoring for Non-Native Japanese

Unleashing Large Language Models' Proficiency in Zero-shot Essay Scoring

Harnessing LLMs for multi-dimensional writing assessment: Reliability and alignment with human judgments

Is GPT-4 Alone Sufficient for Automated Essay Scoring?: A Comparative Judgment Approach Based on Rater Cognition

Enhancing LLM-Based Feedback: Insights from Intelligent Tutoring Systems and the Learning Sciences

Evoke: Evoking Critical Thinking Abilities in LLMs via Reviewer-Author Prompt Editing

Rationale Behind Essay Scores: Enhancing S-LLM's Multi-Trait Essay Scoring with Rationale Generated by LLMs

A Chain-of-Thought Prompting Approach with LLMs for Evaluating Students' Formative Assessment Responses in Science

Prompt Agnostic Essay Scorer: A Domain Generalization Approach to Cross-prompt Automated Essay Scoring

"My Grade is Wrong!": A Contestable AI Framework for Interactive Feedback in Evaluating Student Essays

Automated Essay Scoring Via Example-Based Learning

AESPrompt: Self-supervised Constraints for Automated Essay Scoring with Prompt Tuning

Which is better? Exploring Prompting Strategy For LLM-based Metrics

Evaluating the Evaluator: Measuring LLMs' Adherence to Task Evaluation Instructions

Beyond Scores: A Modular RAG-Based System for Automatic Short Answer Scoring with Feedback

Prompting LLMs to Compose Meta-Review Drafts from Peer-Review Narratives of Scholarly Manuscripts

Unveiling Scoring Processes: Dissecting the Differences between LLMs and Human Graders in Automatic Scoring