Abstract:Assessing student's answers and in particular natural language answers is a crucial challenge in the field of education. Advances in machine learning, including transformer-based models such as Large Language Models(LLMs), have led to significant progress in various natural language tasks. Nevertheless, amidst the growing trend of evaluating LLMs across diverse tasks, evaluating LLMs in the realm of automated answer assesment has not received much attention. To address this gap, we explore the potential of using LLMs for automated assessment of student's short and open-ended answer. Particularly, we use LLMs to compare students' explanations with expert explanations in the context of line-by-line explanations of computer programs.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the automatic assessment of students' natural - language explanations of code understanding in the field of education. Specifically, the author explores the use of large - language models (LLMs) to automatically assess students' short open - ended answers, especially self - explanations in the programming field. This assessment method aims to compare the similarity between students' line - by - line explanations of computer programs and expert explanations to determine whether students' understanding is correct. ### Background and Motivation In the field of education, especially in computer science education, assessing students' answers, especially natural - language answers, is an important challenge. In recent years, with the development of machine - learning technologies, especially Transformer - based models such as large - language models (LLMs), significant progress has been made in various natural - language tasks. However, there is relatively little research on how to use these models for automatic answer assessment. Therefore, this paper aims to fill this research gap and explore the potential of LLMs in automatically assessing students' code understanding. ### Main Contributions 1. **Explore the application of LLMs in automatic assessment**: The author studies the performance of LLMs in assessing students' natural - language explanations of code understanding and compares them with traditional encoder models. 2. **Propose multiple assessment strategies**: Evaluate the performance of LLMs through different prompt strategies (such as zero - shot, few - shot, and chain - of - thought prompts). 3. **Dataset and experimental design**: Use a dataset containing students' and experts' explanations of code snippets, and conduct detailed experiments and analyses through multiple models and methods. ### Method Overview - **Encoder model assessment**: Use models such as BERTScore, Universal Sentence Encoder (USE), and SentenceBERT to calculate the semantic similarity between students' and experts' explanations. - **LLMs assessment**: Evaluate the performance of LLMs through different prompt strategies (such as zero - shot, few - shot, and chain - of - thought prompts). These prompt strategies include simple similarity - scoring prompts, few - shot learning, and chain - of - thought prompts. ### Experimental Results - **Encoder model**: The results show that the fine - tuned SentenceTransformer model (such as all - mpnet) performs best in assessing students' answers. - **LLMs**: Different versions of ChatGPT perform excellently in assessing semantic similarity, especially when using a 0 - 1 scoring scale and chain - of - thought prompts. The performance of GPT - 4 is particularly outstanding, approaching or even exceeding that of the fine - tuned encoder model. ### Conclusion The research in this paper shows that large - language models (LLMs) have great potential in automatically assessing students' natural - language explanations of code understanding, especially when using appropriate prompt strategies. Future work will further optimize the performance of these models and explore more application scenarios.

Automated Assessment of Students' Code Comprehension using LLMs

"Which LLM should I use?": Evaluating LLMs for tasks performed by Undergraduate Computer Science Students

Can LLMs Grade Short-Answer Reading Comprehension Questions : An Empirical Study with a Novel Dataset

Automated Feedback in Math Education: A Comparative Analysis of LLMs for Open-Ended Responses

Machine Vs Machine: Large Language Models (llms) in Applied Machine Learning High-Stakes Open-Book Exams

Applying Large Language Models for Automated Essay Scoring for Non-Native Japanese

A Chain-of-Thought Prompting Approach with LLMs for Evaluating Students' Formative Assessment Responses in Science

Evaluating Students' Open-ended Written Responses with LLMs: Using the RAG Framework for GPT-3.5, GPT-4, Claude-3, and Mistral-Large

A Large Language Model Approach to Educational Survey Feedback Analysis

Large Language Model as an Assignment Evaluator: Insights, Feedback, and Challenges in a 1000+ Student Course

Analyzing Large Language Models for Classroom Discussion Assessment

Using Large Language Models for Automated Grading of Student Writing about Science

Can Large Language Models Automatically Score Proficiency of Written Essays?

An Automated Explainable Educational Assessment System Built on LLMs

Evaluating Language Models for Generating and Judging Programming Feedback

Exploring the Responses of Large Language Models to Beginner Programmers' Help Requests

Large Language Models for Code Analysis: Do LLMs Really Do Their Job?

Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study

TAMIGO: Empowering Teaching Assistants using LLM-assisted viva and code assessment in an Advanced Computing Class

Large Language Models in Computer Science Education: A Systematic Literature Review

Comparing Code Explanations Created by Students and Large Language Models