Automated Assessment of Students' Code Comprehension using LLMs

Priti Oli,Rabin Banjade,Jeevan Chapagain,Vasile Rus
2023-12-20
Abstract:Assessing student's answers and in particular natural language answers is a crucial challenge in the field of education. Advances in machine learning, including transformer-based models such as Large Language Models(LLMs), have led to significant progress in various natural language tasks. Nevertheless, amidst the growing trend of evaluating LLMs across diverse tasks, evaluating LLMs in the realm of automated answer assesment has not received much attention. To address this gap, we explore the potential of using LLMs for automated assessment of student's short and open-ended answer. Particularly, we use LLMs to compare students' explanations with expert explanations in the context of line-by-line explanations of computer programs.
Computers and Society,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the automatic assessment of students' natural - language explanations of code understanding in the field of education. Specifically, the author explores the use of large - language models (LLMs) to automatically assess students' short open - ended answers, especially self - explanations in the programming field. This assessment method aims to compare the similarity between students' line - by - line explanations of computer programs and expert explanations to determine whether students' understanding is correct. ### Background and Motivation In the field of education, especially in computer science education, assessing students' answers, especially natural - language answers, is an important challenge. In recent years, with the development of machine - learning technologies, especially Transformer - based models such as large - language models (LLMs), significant progress has been made in various natural - language tasks. However, there is relatively little research on how to use these models for automatic answer assessment. Therefore, this paper aims to fill this research gap and explore the potential of LLMs in automatically assessing students' code understanding. ### Main Contributions 1. **Explore the application of LLMs in automatic assessment**: The author studies the performance of LLMs in assessing students' natural - language explanations of code understanding and compares them with traditional encoder models. 2. **Propose multiple assessment strategies**: Evaluate the performance of LLMs through different prompt strategies (such as zero - shot, few - shot, and chain - of - thought prompts). 3. **Dataset and experimental design**: Use a dataset containing students' and experts' explanations of code snippets, and conduct detailed experiments and analyses through multiple models and methods. ### Method Overview - **Encoder model assessment**: Use models such as BERTScore, Universal Sentence Encoder (USE), and SentenceBERT to calculate the semantic similarity between students' and experts' explanations. - **LLMs assessment**: Evaluate the performance of LLMs through different prompt strategies (such as zero - shot, few - shot, and chain - of - thought prompts). These prompt strategies include simple similarity - scoring prompts, few - shot learning, and chain - of - thought prompts. ### Experimental Results - **Encoder model**: The results show that the fine - tuned SentenceTransformer model (such as all - mpnet) performs best in assessing students' answers. - **LLMs**: Different versions of ChatGPT perform excellently in assessing semantic similarity, especially when using a 0 - 1 scoring scale and chain - of - thought prompts. The performance of GPT - 4 is particularly outstanding, approaching or even exceeding that of the fine - tuned encoder model. ### Conclusion The research in this paper shows that large - language models (LLMs) have great potential in automatically assessing students' natural - language explanations of code understanding, especially when using appropriate prompt strategies. Future work will further optimize the performance of these models and explore more application scenarios.