Grade Like a Human: Rethinking Automated Assessment with Large Language Models

Wenjing Xie,Juxin Niu,Chun Jason Xue,Nan Guan
2024-05-30
Abstract:While large language models (LLMs) have been used for automated grading, they have not yet achieved the same level of performance as humans, especially when it comes to grading complex questions. Existing research on this topic focuses on a particular step in the grading procedure: grading using predefined rubrics. However, grading is a multifaceted procedure that encompasses other crucial steps, such as grading rubrics design and post-grading review. There has been a lack of systematic research exploring the potential of LLMs to enhance the entire grading~process. In this paper, we propose an LLM-based grading system that addresses the entire grading procedure, including the following key components: 1) Developing grading rubrics that not only consider the questions but also the student answers, which can more accurately reflect students' performance. 2) Under the guidance of grading rubrics, providing accurate and consistent scores for each student, along with customized feedback. 3) Conducting post-grading review to better ensure accuracy and fairness. Additionally, we collected a new dataset named OS from a university operating system course and conducted extensive experiments on both our new dataset and the widely used Mohler dataset. Experiments demonstrate the effectiveness of our proposed approach, providing some new insights for developing automated grading systems based on LLMs.
Artificial Intelligence
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the problem that the performance of large language models (LLMs) in automatic grading tasks has not yet reached the human level, especially when dealing with complex problems. Specifically, existing research mainly focuses on a specific step in the grading process - grading based on pre - defined grading criteria, while ignoring other crucial steps such as grading criteria design and post - grading review. These problems lead to the following challenges: 1. **Rubric Generation**: - Existing systems require educators to painstakingly design detailed grading criteria for each question, which is both time - consuming and labor - intensive. - Even a minor change in the grading criteria may lead to significantly different grading results, and educators cannot predict which grading criteria are the most effective. - The design of the grading criteria and the grading process are usually carried out independently, which means that the grading criteria cannot be adjusted according to the actual responses of students. 2. **Consistency & Fairness**: - The same answer should receive the same score in multiple independent gradings, and similar answers should also receive similar scores. - Due to the inherent randomness and hallucination phenomena of LLMs, the model may produce completely different grades for similar inputs. - Existing automatic grading systems evaluate each answer independently and lack means to evaluate the fairness of the entire grading process and optimize it. 3. **Handling of Complex Problems**: - For complex and open - ended questions, existing automatic grading systems are difficult to handle. For example, some open - ended questions have no standard answers, complex questions may require multiple sub - steps for grading, and long questions and answers may exceed the context length of LLMs. To address these challenges, the paper proposes a multi - agent grading system named "Grade - Like - a - Human", which divides the grading process into three stages: grading criteria generation, grading, and post - grading review. Through this method, the paper attempts to achieve more accurate, consistent, and fair automatic grading. ### Main Contributions - **Improvement in System Perspective**: Point out that existing automatic grading systems lack a systematic perspective to handle grading tasks, especially facing challenges in grading criteria generation, grading consistency, and fairness for complex problems. - **Multi - Agent Grading Framework**: Propose a multi - agent grading framework for the first time, which is able to plan, reflect, and adjust at multiple stages of the grading task. - **New Dataset**: Collect and open - source a dataset named OS for evaluating the performance of LLMs in grading tasks. This dataset is from the operating system courses in computer science majors, including all questions and student answers in tutorials and assignments, along with human - given grades. Through these improvements, the paper shows how to use LLMs to improve the accuracy, reliability, and fairness of automatic grading.