Abstract:Automated essay scoring (AES) is a compelling topic in Learning Analytics for the primary reason that recent advances in AI find it as a good testbed to explore artificial supplementation of human creativity. However, a vast swath of research tackles AES only holistically; few have even developed AES models at the rubric level, the very first layer of explanation underlying the prediction of holistic scores. Consequently, the AES black box has remained impenetrable. Although several algorithms from Explainable Artificial Intelligence have recently been published, no research has yet investigated the role that these explanation models can play in: (a) discovering the decision-making process that drives AES, (b) fine-tuning predictive models to improve generalizability and interpretability, and (c) providing personalized, formative, and fine-grained feedback to students during the writing process. Building on previous studies where models were trained to predict both the holistic and rubric scores of essays, using the Automated Student Assessment Prize’s essay datasets, this study focuses on predicting the quality of the writing style of Grade-7 essays and exposes the decision processes that lead to these predictions. In doing so, it evaluates the impact of deep learning (multi-layer perceptron neural networks) on the performance of AES. It has been found that the effect of deep learning can be best viewed when assessing the trustworthiness of explanation models. As more hidden layers were added to the neural network, the descriptive accuracy increased by about 10%. This study shows that faster (up to three orders of magnitude) SHAP implementations are as accurate as the slower model-agnostic one. It leverages the state-of-the-art in natural language processing, applying feature selection on a pool of 1592 linguistic indices that measure aspects of text cohesion, lexical diversity, lexical sophistication, and syntactic sophistication and complexity. In addition to the list of most globally important features, this study reports (a) a list of features that are important for a specific essay (locally), (b) a range of values for each feature that contribute to higher or lower rubric scores, and (c) a model that allows to quantify the impact of the implementation of formative feedback.

Can GPT-4 do L2 analytic assessment?

Applying Large Language Models for Automated Essay Scoring for Non-Native Japanese

Is GPT-4 Alone Sufficient for Automated Essay Scoring?: A Comparative Judgment Approach Based on Rater Cognition

Performance of a Large‐Language Model in scoring construction management capstone design projects

Can Large Language Models Automatically Score Proficiency of Written Essays?

Harnessing LLMs for multi-dimensional writing assessment: Reliability and alignment with human judgments

Large Language Models in Student Assessment: Comparing ChatGPT and Human Graders

Applying Large Language Models and Chain-of-Thought for Automatic Scoring

Performance of the pre-trained large language model GPT-4 on automated short answer grading

Evaluating GPT-4 at Grading Handwritten Solutions in Math Exams

Using ChatGPT to Score Essays and Short-Form Constructed Responses

Automated assessment of non-native learner essays: Investigating the role of linguistic features

Are Large Language Models Good Essay Graders?

AI-assisted Automated Short Answer Grading of Handwritten University Level Mathematics Exams

Human-AI Collaborative Essay Scoring: A Dual-Process Framework with LLMs

Can Large Language Models Make the Grade? An Empirical Study Evaluating LLMs Ability to Mark Short Answer Questions in K-12 Education

Explainable Automated Essay Scoring: Deep Learning Really Has Pedagogical Value

Can GPT-4 learn to analyse moves in research article abstracts?

GPT is an effective tool for multilingual psychological text analysis

Is GPT-4 a Good Data Analyst?