Abstract:Providing explainable and faithful feedback is crucial for automated student answer assessment. In this paper, we introduce a novel framework that explores using ChatGPT, a cutting-edge large language model, for the concurrent tasks of student answer scoring and rationale generation. We identify the appropriate instructions by prompting ChatGPT with different templates to collect the rationales, where inconsistent rationales are refined to align with marking standards. The refined ChatGPT outputs enable us to fine-tune a smaller language model that simultaneously assesses student answers and provides rationales. Extensive experiments on the benchmark dataset show that the proposed method improves the overall QWK score by 11% compared to ChatGPT. Furthermore, our thorough analysis and human evaluation demonstrate that the rationales generated by our proposed method are comparable to those of ChatGPT. Our approach provides a viable solution to achieve explainable automated assessment in education. Code available at <a class="link-external link-https" href="https://github.com/lijiazheng99/aera" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the problems of interpretability and credibility in automated student answer assessment. Specifically, the paper proposes a new framework - AERA (Automated Explainable Student Response Assessment), which utilizes the capabilities of the large - language model ChatGPT to generate explanatory reasons and transfers these capabilities to a smaller language model through distillation techniques to achieve efficient and interpretable student answer assessment. ### Background and challenges 1. **Limitations of manual assessment**: - Manually providing detailed feedback is time - consuming and labor - intensive. - Differences in scoring criteria among different assessors may lead to inconsistent scoring. 2. **Limitations of existing automated assessment models**: - Most existing automated student answer assessment models are based on pre - trained language models (PLMs). Although they improve assessment efficiency and consistency, they lack transparency. - Black - box models in classification tasks are difficult to interpret, resulting in low credibility of assessment results. - Generating explanatory reasons requires a large amount of labeled data, which is costly and difficult to obtain in practical applications. ### Solutions 1. **Using ChatGPT to generate reasons**: - Design different prompt templates to guide ChatGPT to generate scores and reasons for student answers. - Through multiple - round experiments, select the most appropriate prompt template to improve the quality of generated reasons. 2. **Reason and data refinement**: - Introduce a reason refinement module to improve the quality and usability of generated reasons. - Through methods such as semantic confidence intervals, identify and correct mis - labeled data to reduce data uncertainty. 3. **Distilling small - language models**: - Use the reasons generated by ChatGPT as training data to fine - tune a smaller language model (such as Long T5) to achieve efficient interpretive assessment. - Verified by experiments, this method performs well on multiple datasets, especially in generating high - quality reasons. ### Main contributions 1. **Proposed the AERA framework**: Distill ChatGPT's reason - generating capabilities into a smaller language model to achieve efficient and interpretable student answer assessment. 2. **Introduced two strategies**: Used to independently refine the reasons generated by ChatGPT and improve their quality and credibility. 3. **Through comprehensive experiments and human evaluations**: Proved that this method can generate high - quality reasons without additional labeling and significantly improve assessment performance. ### Conclusion The AERA framework successfully solves the problems of interpretability and credibility in automated student answer assessment by leveraging the capabilities of large - language models and combining data refinement and model distillation techniques, providing a new solution for automated assessment in the education field.

Distilling ChatGPT for Explainable Automated Student Answer Assessment

ChatGPT's Capabilities in Providing Feedback on Undergraduate Students’ Argumentation: A Case Study

AERA Chat: An Interactive Platform for Automated Explainable Student Answer Assessment

Using ChatGPT to Score Essays and Short-Form Constructed Responses

Automated Assessment of Encouragement and Warmth in Classrooms Leveraging Multimodal Emotional Features and ChatGPT

How Can I Improve? Using GPT to Highlight the Desired and Undesired Parts of Open-ended Responses

Exploring the Efficacy of ChatGPT in Analyzing Student Teamwork Feedback with an Existing Taxonomy

Calibrating LLMs with Preference Optimization on Thought Trees for Generating Rationale in Science Question Scoring

Evaluating ChatGPT's Information Extraction Capabilities: An Assessment of Performance, Explainability, Calibration, and Faithfulness

Few-shot is enough: exploring ChatGPT prompt engineering method for automatic question generation in english education

Fine-tuning ChatGPT for Automatic Scoring

A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity

"Is ChatGPT a Better Explainer than My Professor?": Evaluating the Explanation Capabilities of LLMs in Conversation Compared to a Human Baseline

Enhancing Multi-Domain Automatic Short Answer Grading through an Explainable Neuro-Symbolic Pipeline

Applying large language models and chain-of-thought for automatic scoring

Can ChatGPT Effectively Complement Teacher Assessment of Undergraduate Students’ Academic Writing?

Can ChatGPT Replace Traditional KBQA Models? An In-depth Analysis of the Question Answering Performance of the GPT LLM Family

Comparative Analysis of GPT-4 and Human Graders in Evaluating Praise Given to Students in Synthetic Dialogues

Can ChatGPT generate practice question explanations for medical students, a new faculty teaching tool?

Honest Students from Untrusted Teachers: Learning an Interpretable Question-Answering Pipeline from a Pretrained Language Model

ChatGPT for Next Generation Science Learning