Automated Feedback in Math Education: A Comparative Analysis of LLMs for Open-Ended Responses

Sami Baral,Eamon Worden,Wen-Chiang Lim,Zhuang Luo,Christopher Santorelli,Ashish Gurung,Neil Heffernan

2024-10-30

Abstract:The effectiveness of feedback in enhancing learning outcomes is well documented within Educational Data Mining (EDM). Various prior research has explored methodologies to enhance the effectiveness of feedback. Recent developments in Large Language Models (LLMs) have extended their utility in enhancing automated feedback systems. This study aims to explore the potential of LLMs in facilitating automated feedback in math education. We examine the effectiveness of LLMs in evaluating student responses by comparing 3 different models: Llama, SBERT-Canberra, and GPT4 model. The evaluation requires the model to provide both a quantitative score and qualitative feedback on the student's responses to open-ended math problems. We employ Mistral, a version of Llama catered to math, and fine-tune this model for evaluating student responses by leveraging a dataset of student responses and teacher-written feedback for middle-school math problems. A similar approach was taken for training the SBERT model as well, while the GPT4 model used a zero-shot learning approach. We evaluate the model's performance in scoring accuracy and the quality of feedback by utilizing judgments from 2 teachers. The teachers utilized a shared rubric in assessing the accuracy and relevance of the generated feedback. We conduct both quantitative and qualitative analyses of the model performance. By offering a detailed comparison of these methods, this study aims to further the ongoing development of automated feedback systems and outlines potential future directions for leveraging generative LLMs to create more personalized learning experiences.

Computers and Society,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to use large language models (LLMs) to provide automated grading and feedback in mathematics education. Specifically, the researchers explored the performance of three different models - the Llama variant Mistral, SBERT - Canberra, and GPT - 4 - when evaluating students' answers to open - ended questions. These models are required not only to give a quantitative score but also to provide qualitative feedback. The study evaluated the effectiveness of these models by comparing them with the grading and feedback provided by teachers. In addition, the study also explored the potential of these automated feedback systems in improving learning outcomes and their application directions in future personalized learning experiences.

Automated Feedback in Math Education: A Comparative Analysis of LLMs for Open-Ended Responses

Improving the Validity of Automatically Generated Feedback via Reinforcement Learning

Can Large Language Models Replicate ITS Feedback on Open-Ended Math Questions?

Evaluating Language Models for Generating and Judging Programming Feedback

A large language model-assisted education tool to provide feedback on open-ended responses

Evaluating Students' Open-ended Written Responses with LLMs: Using the RAG Framework for GPT-3.5, GPT-4, Claude-3, and Mistral-Large

A Large Language Model Approach to Educational Survey Feedback Analysis

Learning to Love Edge Cases in Formative Math Assessment: Using the AMMORE Dataset and Chain-of-Thought Prompting to Improve Grading Accuracy

Open Source Language Models Can Provide Feedback: Evaluating LLMs' Ability to Help Students Using GPT-4-As-A-Judge

Investigating Automatic Scoring and Feedback using Large Language Models

Evaluating and Optimizing Educational Content with Large Language Model Judgments

Leveraging large language models to construct feedback from medical multiple-choice Questions

LLMs-as-Instructors: Learning from Errors Toward Automating Model Improvement

Automated Assessment of Students' Code Comprehension using LLMs

MARIO Eval: Evaluate Your Math LLM with your Math LLM--A mathematical dataset evaluation toolkit

Enhancing LLM-Based Feedback: Insights from Intelligent Tutoring Systems and the Learning Sciences

Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange

On the Opportunities of Large Language Models for Programming Process Data

Exploring the Responses of Large Language Models to Beginner Programmers' Help Requests

Improving Bilingual Capabilities of Language Models to Support Diverse Linguistic Practices in Education

Automated Evaluation of Classroom Instructional Support with LLMs and BoWs: Connecting Global Predictions to Specific Feedback