Distilling ChatGPT for Explainable Automated Student Answer Assessment

Jiazheng Li,Lin Gui,Yuxiang Zhou,David West,Cesare Aloisi,Yulan He
2023-10-24
Abstract:Providing explainable and faithful feedback is crucial for automated student answer assessment. In this paper, we introduce a novel framework that explores using ChatGPT, a cutting-edge large language model, for the concurrent tasks of student answer scoring and rationale generation. We identify the appropriate instructions by prompting ChatGPT with different templates to collect the rationales, where inconsistent rationales are refined to align with marking standards. The refined ChatGPT outputs enable us to fine-tune a smaller language model that simultaneously assesses student answers and provides rationales. Extensive experiments on the benchmark dataset show that the proposed method improves the overall QWK score by 11% compared to ChatGPT. Furthermore, our thorough analysis and human evaluation demonstrate that the rationales generated by our proposed method are comparable to those of ChatGPT. Our approach provides a viable solution to achieve explainable automated assessment in education. Code available at <a class="link-external link-https" href="https://github.com/lijiazheng99/aera" rel="external noopener nofollow">this https URL</a>.
Computation and Language
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve the problems of interpretability and credibility in automated student answer assessment. Specifically, the paper proposes a new framework - AERA (Automated Explainable Student Response Assessment), which utilizes the capabilities of the large - language model ChatGPT to generate explanatory reasons and transfers these capabilities to a smaller language model through distillation techniques to achieve efficient and interpretable student answer assessment. ### Background and challenges 1. **Limitations of manual assessment**: - Manually providing detailed feedback is time - consuming and labor - intensive. - Differences in scoring criteria among different assessors may lead to inconsistent scoring. 2. **Limitations of existing automated assessment models**: - Most existing automated student answer assessment models are based on pre - trained language models (PLMs). Although they improve assessment efficiency and consistency, they lack transparency. - Black - box models in classification tasks are difficult to interpret, resulting in low credibility of assessment results. - Generating explanatory reasons requires a large amount of labeled data, which is costly and difficult to obtain in practical applications. ### Solutions 1. **Using ChatGPT to generate reasons**: - Design different prompt templates to guide ChatGPT to generate scores and reasons for student answers. - Through multiple - round experiments, select the most appropriate prompt template to improve the quality of generated reasons. 2. **Reason and data refinement**: - Introduce a reason refinement module to improve the quality and usability of generated reasons. - Through methods such as semantic confidence intervals, identify and correct mis - labeled data to reduce data uncertainty. 3. **Distilling small - language models**: - Use the reasons generated by ChatGPT as training data to fine - tune a smaller language model (such as Long T5) to achieve efficient interpretive assessment. - Verified by experiments, this method performs well on multiple datasets, especially in generating high - quality reasons. ### Main contributions 1. **Proposed the AERA framework**: Distill ChatGPT's reason - generating capabilities into a smaller language model to achieve efficient and interpretable student answer assessment. 2. **Introduced two strategies**: Used to independently refine the reasons generated by ChatGPT and improve their quality and credibility. 3. **Through comprehensive experiments and human evaluations**: Proved that this method can generate high - quality reasons without additional labeling and significantly improve assessment performance. ### Conclusion The AERA framework successfully solves the problems of interpretability and credibility in automated student answer assessment by leveraging the capabilities of large - language models and combining data refinement and model distillation techniques, providing a new solution for automated assessment in the education field.