Abstract:While large language models (LLMs) have been used for automated grading, they have not yet achieved the same level of performance as humans, especially when it comes to grading complex questions. Existing research on this topic focuses on a particular step in the grading procedure: grading using predefined rubrics. However, grading is a multifaceted procedure that encompasses other crucial steps, such as grading rubrics design and post-grading review. There has been a lack of systematic research exploring the potential of LLMs to enhance the entire grading~process. In this paper, we propose an LLM-based grading system that addresses the entire grading procedure, including the following key components: 1) Developing grading rubrics that not only consider the questions but also the student answers, which can more accurately reflect students' performance. 2) Under the guidance of grading rubrics, providing accurate and consistent scores for each student, along with customized feedback. 3) Conducting post-grading review to better ensure accuracy and fairness. Additionally, we collected a new dataset named OS from a university operating system course and conducted extensive experiments on both our new dataset and the widely used Mohler dataset. Experiments demonstrate the effectiveness of our proposed approach, providing some new insights for developing automated grading systems based on LLMs.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the problem that the performance of large language models (LLMs) in automatic grading tasks has not yet reached the human level, especially when dealing with complex problems. Specifically, existing research mainly focuses on a specific step in the grading process - grading based on pre - defined grading criteria, while ignoring other crucial steps such as grading criteria design and post - grading review. These problems lead to the following challenges: 1. **Rubric Generation**: - Existing systems require educators to painstakingly design detailed grading criteria for each question, which is both time - consuming and labor - intensive. - Even a minor change in the grading criteria may lead to significantly different grading results, and educators cannot predict which grading criteria are the most effective. - The design of the grading criteria and the grading process are usually carried out independently, which means that the grading criteria cannot be adjusted according to the actual responses of students. 2. **Consistency & Fairness**: - The same answer should receive the same score in multiple independent gradings, and similar answers should also receive similar scores. - Due to the inherent randomness and hallucination phenomena of LLMs, the model may produce completely different grades for similar inputs. - Existing automatic grading systems evaluate each answer independently and lack means to evaluate the fairness of the entire grading process and optimize it. 3. **Handling of Complex Problems**: - For complex and open - ended questions, existing automatic grading systems are difficult to handle. For example, some open - ended questions have no standard answers, complex questions may require multiple sub - steps for grading, and long questions and answers may exceed the context length of LLMs. To address these challenges, the paper proposes a multi - agent grading system named "Grade - Like - a - Human", which divides the grading process into three stages: grading criteria generation, grading, and post - grading review. Through this method, the paper attempts to achieve more accurate, consistent, and fair automatic grading. ### Main Contributions - **Improvement in System Perspective**: Point out that existing automatic grading systems lack a systematic perspective to handle grading tasks, especially facing challenges in grading criteria generation, grading consistency, and fairness for complex problems. - **Multi - Agent Grading Framework**: Propose a multi - agent grading framework for the first time, which is able to plan, reflect, and adjust at multiple stages of the grading task. - **New Dataset**: Collect and open - source a dataset named OS for evaluating the performance of LLMs in grading tasks. This dataset is from the operating system courses in computer science majors, including all questions and student answers in tutorials and assignments, along with human - given grades. Through these improvements, the paper shows how to use LLMs to improve the accuracy, reliability, and fairness of automatic grading.

Grade Like a Human: Rethinking Automated Assessment with Large Language Models

Unveiling Scoring Processes: Dissecting the Differences between LLMs and Human Graders in Automatic Scoring

Large Language Models As MOOCs Graders

Grading Massive Open Online Courses Using Large Language Models

Human-AI Collaborative Essay Scoring: A Dual-Process Framework with LLMs

Towards Scalable Automated Grading: Leveraging Large Language Models for Conceptual Question Evaluation in Engineering

Large Language Model as an Assignment Evaluator: Insights, Feedback, and Challenges in a 1000+ Student Course

Towards LLM-based Autograding for Short Textual Answers

Performance of a Large‐Language Model in scoring construction management capstone design projects

Using AI Large Language Models for Grading in Education: A Hands-On Test for Physics

Investigating Automatic Scoring and Feedback using Large Language Models

Can LLMs Grade Short-Answer Reading Comprehension Questions : An Empirical Study with a Novel Dataset

Are Large Language Models Good Essay Graders?

SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading

Large Language Models as Partners in Student Essay Evaluation

Automating Autograding: Large Language Models as Test Suite Generators for Introductory Programming

Large Language Models in Student Assessment: Comparing ChatGPT and Human Graders

Analyzing Large Language Models for Classroom Discussion Assessment

Can Large Language Models Make the Grade? An Empirical Study Evaluating LLMs Ability to Mark Short Answer Questions in K-12 Education

A LLM-Powered Automatic Grading Framework with Human-Level Guidelines Optimization

Style Over Substance: Evaluation Biases for Large Language Models